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Preface 


First and foremost, I would like to express my det'p gratitude to the distinguished 
authors of this volume, to Joel Greenhouse and Rick Vitale, the Editors of the IMS 
Lecture Notes and Monograph Series, to Elyse Gustafson, to the referees of the 
papers, and to the p(H>ple at Mattson Publishing Services and VTEX who worked 
dtHlic:atedly and diligently in order to make this a reality. This book is a team effort. 
I must express my very special thanks to one person: Geri Mattson. I could write a 
poetic paragraph to thank her; let me simply say she is unreal. I am also thankful 
to Kristy Brewer, Rebekah Holmes, Teena Seele and Norma Lucas for editorial 
support at Purdue University. They were smilingly helpful at all tinu^s. Finally, 
I thank my teachers and friends T. Krishnan, B.V. R<io, Jim Pitman and Jenifer 
Brown for their help and support. 

The quality of the papers in this Festschrift, to my pride, joy, and satisfaction, is 
very high. I looked through every paper in this volume. A large number of the papers 
are very original. Some open a window to a major area through a well presented 
review. The articles reflect the main themes Herman Rubin has contributed to 
over half a century. I am so thankful that the authors gave (luality papers; it was 
magnanimous. 

When, in March, 2003, 1 approached the IMS with a proposal for a Festschrift for 
Herman Rubin, two emotions were playing in my brain. I have had an affectionate 
relationship with Herman for about a quarter century now. But 1 was also mindful 
that the man is a rare scholar, in the finest tradition of that word. Herman coukl 
have had a few Inmdred more papers if he had insisted on getting credit for all he 
did for the rest of us, without ever asking for it. Now that’s unselfish. I am honored 
and I am delighted that the IMS Fest.schrift for Herman, a fully deserved honor, 
even belated, is now in print, I speak for a community when I wish Herman a happy, 
healthy, and intellectually fulfilling long life. Herman, as Herman Chernoff said, is 
ind(XHl a national treasure. 


Anirban DasGupta, 
Purdue University 

Editor 
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Some reminiscences of my friendship 
with Herman Rubin 

Herman ChcrnofF‘ 

Harvard University 


I first met Herman Rubin in 1947 while I was writing a dissertation in aljsentia 
at Columbia University and he was a Fellow at the Institute of Advanced Study 
at Princeton. I had recently been appointed as a research iissistant at the Cowles 
Commission for Research in Economics which was then located at the University 
of Chicago. Herman had completed or almost completed his disvsertation at the 
University of Chicago, and we were to be colleagues at the Cowles Commission 
from .June 1948 to September 1949. 

While I was at Columbia, I was suppo.sed to investigate the possibility of in- 
verting large matrices by computer, because the method used by the Cowles Com- 
mission for estimating the parameters of the Economy, by maximizing a function 
using a gradient method, involved the inversion of matrices. I worked at the Wat- 
son Laboratories which were located then near Columbia and had use of a “Relay 
Calculator” which could be programmed (with plug boards) to multiply matrices. 
With the use of the Relay calculator and a card sorter and lots of fancy footwork, 
it was possible to do the job. At that time the engineers at Watson were beginning 
to build the electronic computer which was to become one of the bases for the 
future development of the IBM computers to follow. But I did not have access to 
that machine. However I did have access to Herman Rubin who came around to 
kibbitz, and to do some of the fancy footwork. At one point the .sorter decided to 
put the cards with the digit 4 into the box for the digit 7. We counterattacked by 
instructing the 7 to go into the reject box. That scheme worked for a while, but the 
sorter replied by putting the 3 into the reject box. I think that we ended up doing 
some of the card sorting by hand. 

At Cowles we had adjacent offices which was not exactly a blessing becau.se 
Herman had a bad habit. He would come in to the office about 7 AM, pound his 
calculator (electric and not electronic) for an hour and then prove a few theorems 
for an hour, and then was ready to discuss important matters with me when I came 
to work. These important matters were usually how to handle certain bridge hands. 
Whatever I suggested was usually wrong. That did not bother me as much as the 
time I had to spend on bridge, a game that I never properly mastered. 

I had a few friends in the Mathematics Department at the University. One of 
them, who had become a long term fixture, related to me how he had thought he 
was very smart (IQ about 180) when he was an undergraduate, until this little high 
school kid showed up, and obviously was more capable than most of the graduate 
students. Needless to say that that enfant terrible was our Herman Rubin. 

While we were at Cowles we coauthored a paper, the main object of which was 
to show that even when not all of the standard conditions were satisfied, large 
sample theory demonstrated that we could still have confidence in our conclu- 
sions. I must admit that my contributions to this effort were only to translate 
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Herman's work into comprliensible English, and to insist on the admittedly im- 
proper use of the word “blithely” to indicate that we could proceed as though all 
was well. 

On another occasion L..1. Savage announced that he had resolved Wald’s dilemma 
in using the Minimax principle, by claiming that what Wald had really meant was 
“Minimax Regret”. In ilhust rating this principle in a class I was teaching, I dis- 
covered that not only could Minimax lead to poor choices, but Minimax Regret 
violated the principle of “lnde|>endence of Irrelevant Alternatives”, a principle that 
had recently l>een enunciated in Arrow’s outstanding result on the Welfare Function. 
When I confronted Savage with this problem, he first denierl that it was serious, 
but after .some discussion, indicated that maybe we should follow up on recent work 
by De Finetti proposing the Bayesian approach. 

In fact I laid out the axioms that I felt should lx? satisfied by an objective 
method of rational decision making. The current terminology is “coherent”. My 
results were sort of negative and later published in Econometrica after I let them 
simmer for a few years. The only thing that almast made sense, was that if we 
neglected one of the axioms, then the rational way to proceed is to treat each 
unknown state of nature as equally likely. This was an unsatisfactory result for 
those hoping for an objective way to do inductive inference. In the meantime both 
Savage and Rubin pursued the Bayesian approach. Savage later became the high 
pri«!st of the Bayesian revolution. But no one seemed to notk?e. that two days after 
the discussion with Savage, Rubin wrote a discussion paper deriving the Bayesian 
solution. What was special about this paper, was that by omitting unnecessary 
verbiage, it was about three pages long, and was, unlike most of Herman’s writ- 
ing, eminently readable. Unfortunately, a copy of this pai«>r which I treasured for 
many years, di.sappeared in my files, and as far as I know, no copy of it exists 
today. 

I recall going to a seminar in the Mathematics Department, where I confess that 
I did not understand the lecture. At the end someone askcrl the speaker whether his 
results could be generalized to another case. The speaker said that he had thought 
about it, but was not clear about how to proceed. Herman spoke up, indicating 
that it was perfectly clear, and explained exactly how the generalization would 
go. This was one of many examples where it was apparent that Herman could 
instantly absorb results that were presented to him, and even see further nontrivial 
couseriuenccs of these results. I envied this clarity of thought because my own 
thinking processes tend to be much more confitsed and usually .some time is needed 
for me to get things straightened out. 

In 1949, Rubin and Arrow left the CowU* Commi-ssion to go to Stanford. Rubin 
joined the new Statistics Department organized by Albert Bowker with the help 
of Abraham Girshick. Arrow was joint in Statistics and Economics. I went to the 
University of Illinois, and was invited to visit Stanford for a semester two years 
later. 1 found the department to be an exciting plar^e to be. partly because of the 
distinguished visitors which included David Blackwell and partly liecause of the 
presence of ONR funding for applied and theoretical programs. Herman was teach- 
ing courses in measure theory and topology, because the Mathematics Department 
was busy with other topics and he felt that Statistics students should at least have 
those basics. 

While I was there, Girshick once was teasing Herman about the fact that the 
news indicated that an African American had just received his Ph.D. at age 18, 
and Herman had not gotten his degree until he was 19. Herman, taking this teasing 
serio\isly, complained that he had spent a year in the Army. 
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That semester, two topics that arose from tlie ONR project gave rise to two 
papers that I wrote and of which I was very proud. They pointed to a direction 
in optimal experimental design on which I spent much time later. Part of one of 
those papers involved finding asymptotic upper and lower botmds on the probability 
that the mean of a sample of independent identicall)’’ distributed random variables 
would exceed a certain constant. This paper repre.sented the first application of large 
deviation theory to a statistical problem. Cramer had derived a much more elegant 
result in 1938, of which I had been ignorant. My result, involving the infimum of 
a moment generating function, was less elegant and le.ss general than the Cramer 
result, but did not require a special condition that Cramer required. ALso, my proof 
could be described as crudely beating the problem to death. Herman claimed that 
he could get a lower bound much easier. I challenged him, and he produced a short 
Chebyshev Inequality type proof, which was so trivial that I did not trouble to cite 
his contribution. 

What a mistake! It seems that Shannon had incorrectly applied the Central 
Limit theorem to the far tails of the distribution in one of his papers on Information 
theory. When his error was pointed out, he discovered the lower bound of Rubin in 
my paper and rescued his results. As a result I have gained great fame in electrical 
engineering circles for the ChernofF bound which was really due to Herman. One 
con.sequence of the simplicity of the proof was that no one ever bothered to read the 
original paper of which I was very proud. For j'ears they referred to Rubin’s bound 
as the Chernov bound, not even spelling my name correctly. I once had the pleasure 
of writing to a friend who sent me a copy of a paper improving on the Chernov 
bound, that I was happy that my name was not associated with such a crummy 
bound. For many years, electrical engineers have come to me and told me that I 
saved their lives, because they were able to describe the bound on their preliminary 
doctoral exams. Fortunately for me, my lasting fame, if any, will depend, not on 
Rubin’s bound, but on Chernoff faces. 

As I was preparing to return to the University of Illinois to finish off my j'^ear 
in 1952, my wife and I had a long discu-ssion with Herman in which he mentioned 
that he had certain requirements for marriage. Evidently his proposed wife would 
have to be a paragon of virtues, beautiful, brilliant, and .lewish. When we returned 
to Stanford five montlis later, Herman had discovered this paragon and she was 
willing and they were already married. 

For a few years after I came to Stanford, Rubin and I had neighboring offices 
at Sequoia Hall. Frequently when I came across a problem that seemed to be one 
that must have been treated in the literature, I would approach Herman and ask 
him about it. It was not unusual for him to say that it was not yet in the literature, 
but that he had already solved that problem. He would then reach into the depths 
of the mountain of paper on his desk, and pull out his solution. Often I would 
come to him with a problem on which I was working, and suggest an approach 
that I might use. His invariable response was “That is the worst way to attack that 
problem.” This response frightened off many students and colleagues, but I found 
that if I persisted in asking why it was the worst way, he would sometimes explain 
why and .sometimes admit that maybe it was a sensible approach. It required a 
certain amount of stubborness, which not everyone had, to confront Herman. But 
I found that, because Herman was my neighbor, I was often saved from following 
false trails, often shown what was known, and often encouraged to pursue profitable 
directions that seemed problematic. 

The .Japanese have a title of National Treasure which they assign to outstanding 
artists and scholars. In my opinion, Herman Rubin, the eternal enfant terrible 
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of Statistics, has served as an American National Treasure by his willingness to 
counsel those not too frightened to hear “That is the worst way’". As I recently 
became an octogenarian, I realize that Herman is no longer the 20 year old I once 
knew, but I have no doubt that he is still intellectually a slightly matured 20 
year old who has contributed mightily to Statistics and from whom we can expect 
more. 
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Evaluating improper priors and the 
recurrence of symmetric Markov chains: 

An overview 

Morris L. Eaton* 

University of Minnesota 

Abstract: Given a parametric statistical model, an improper prior distribu- 
tion can often be used to induce a proper posterior distribution (an inference). 

This inference can then be used to solve decision problems once an action 
space and loss have been specified. One way to evaluate the inference is to ask 
for which estimation problems does the above formal Bayes method produce 
admissible estimators. The relationship of this problem to the recurrence of an 
aussociated symmetric Markov chain is reviewed. 


Appreciation 

Near the end of my graduatt; study at Stanford, Carl Morris and I had a conversation 
which lead us to ask whether or not the usual x^-test for a point null hypothesis in a 
multinomial setting was in fact a proper Bayes test. After a few months of .struggle, 
we eventually reduc;ed the problem to one involving La Place transforms. At this 
point it w’as clear we needed helj), and even clearer whose assistance we should seek 
- namely Herman Rubin. Herman’s stature as a re^searcher, problem solver and font 
of mathematical knowledge was well known to the Stanford students. 

Within a few days of having the problem described to him, Herman had sketched 
an elegant solution minus a few “obvious” details that Carl and I were able to supply 
in the next month or so. This eventually led to an Eaton-Morris-Rubin publication 
in the Journal of Applied Probability. During this collaboration, I was struck with 
Herman’s willingness to share his considerable gifts with two fledgling researchers. 
In the succeeding years it has become clear to me that this is an es.sential part of 
his many contributions to our discipline. Thank you Herman. 

1. Introduction 

This expository paper is concerned primarily with some techniques for trying to 
evaluate the formal Bayes method of solving decision problems. Given a parametric 
model and an improper prior distribution, the method has two basic stejis: 

1. Compute the formal posterior distribution (j)roper) for the parameter given 
the data (assuming this exists) 

2. Use the formal posterior to solve the “no data” version of the decision problem. 

This two step process produces a decision rule whose properties, both desirable and 
undesirable, can be used in the assessment of the posterior distribution and hence 

'School of Statistics, University of Minnesota, 224 Church Street S. E., Minneapolis, MN 55455, 
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the improper prior. Typically, when frequentist mcasuns of assessment are pro- 
posed, they often include some discussion of admis.sibility (or almost admissibility) 
for the formal Bayes rules obtained from the [wsterior. However, there is a delicate 
balance that arises immediately. If only a few decision itroblems are considered in 
the assessment, then the evidence may not be very convincing that the posterior is 
suitable since admissibility is, by it.self, a rather weak optimality proirerty. On the 
other hand, even in simple situatioiLs with appealing improper prior distributions, 
it is certainly possible that there are interesting decision problems where formal 
Bayes solutions are inadmissible (for example, see Blackwell (1951), Eaton (1992, 
Example 7.1), and Smith (1994)). 

One approach to the above problem that has yielded .some interesting and use- 
ful results is based on estimation problems with tpiadratic loss. In this case, formal 
Bayes decision rules are jiLst the posterior means of the functions to be estimated 
and risk functions are expected mean squared error. Conditioas for admissibility, 
obtained from the Blyth-Stein method (see Blyth (1951) and Stein (1955)), involve 
what is often called the integrated risk difference (IRD). In the case of quadratic loss 
estimation, varioiLs techniques such as integration by parts or non-obvious appli- 
cations of the Cauchy-Schwara imKiuality applied to the IRD, sometimes yield ex- 
pressions appropriate for establishing admissibility (for example, see Karlin (1958), 
Stein (1959), Zidek (1970), and Brown and Hwang (1982)). These might be de- 
scribed as “direct analytic techniques.” 

In the past thirty years or so, two rather different comtections have been discov- 
ered that relate quiuiratic loss estimation problems to certain types of “recurrence 
problents.” The first of these appeared in Brown (1971) who appliwl the Blyth- 
Stein method to the problem of establishing the admls-sibility of an estimator of the 
mean vector of a p-dimcnsional normal distribution with covariance equal to the 
identity matrix. The loss function under consideration was the usual sum of squared 
errors. In attempting to verify the Blyth-Stein txmdition for a given estimator 6 , 
Brown showed that there corresponds a “natural" diffusion process, although this 
connt?ction is far from obvious. However, the heuristics in Section 1 of Brown’s 
paper provide a great deal of insight into the argument. A basic result in Brown 
(1971) is that the estimator 6 is fKlmi.ssible iff the associated difftision is recurrent. 
This result depends on some regularity conditions on the risk function of 6 , but 
holds in full generality when the risk function of <5 is bounded. The arguments in 
Brown’s |>aper depend to some extent on the underlying multivariate normal sam- 
pling model. Srinivasan (1981) contains material related to Brown (1971). The basic 
approiu-h in Brown has been extended to the Poisson case in Johnstone (1984, 1986) 
where the diffusion is replaced by a birth and death process. A common feature of 
the normal and Poisson problems is that the associatetl continuoas lime stoclias- 
tic process whose recurrence implies admissibility, are defined on the sample space 
(as opposed to the parameter space) of the estimation problem. In addition the 
inference problems under consideration arc the estimation of the “natural” para- 
meters of the model. Brown (1979) describes some general methods for establishing 
admissibility of estimators. These methods are based on the ideas underlying the 
admissibility-recurrence connection described above. 

Formal Bayes methods are the focus of this paper. Since the posterior distri- 
bution is the basic inferential object in Bayesian analysis, it seems rather natural 
tliat evaluative criteria will involve this distribution in both projter and improper 
prior contexts. As in Brown (1971), Just why “recurrence problents” arise in this 
context is far from clear. Briefly, the connection results from using admissibility in 
quadratic loss estimation problems to as.sess the suitability of the posterior distri- 
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butiun. In particular, if the pt)sterior distribution of 0 given the data x is Q(d6\x) 
(depending, of course, on a model and an improper prior), then the formal Bayes 
est imator of any Irounded function of say is the [jasterior mean of <t>{0), say 

0(x) = j mQ{d0\T). 

It was argued in Eaton (1982. 1992) that tlie ‘•admissibility” of 0 for all bounded 
4> constituted plausible evidence that the formal |X)sterior might 1 h‘ suitaltle for 
making inferences aliout 6. To connect tlie admissibility of 0 to recurrence, first 
observe that when <t>A{6) = IaW is an indicator function of a sulrset A of the 
parameter space, then the formal Bayes estimator 

<Pa(-r) = Q(.dk) 

is the pasterior probability of A. If i] denotes the "true value of the model parame- 
ter” from which ,Y was .samplerl. then the expected value (\mder the model) of the 
estimator Q(4|X) is 

/f(.4|r,) = £„y(.l|A'). (1.1) 

Next, observe that /f in ( 1 . 1 ) is a transition function defined on the paramt'ter space 
0 of the problem. TIuls. R induces a discrete time Markov chain whose state space 
is 0. The remainder of this paiwr is devoted to a dLscussion of the following result. 

Theorem 1.1. If the Markov chain on 0 defined by R in (1.1) is "rccunrnl,” then 
<f> is ’‘admissible” for each bounded meusurable 0 when the loss is quadratie. 

Becau.se 0 is allowed to be rather general, the recurrence of the Markov chain 
has to be defined rather carefully - this is the reason for the quotes on recurrent. 
As in Brown (1971), what comux'ts the decision theoretic asi>ects of the problem 
to the Markov chain is the Blyth Stein IwhniqiK" and this yields what is often 
(•allerl "almost a<imi.ssibility.” Thus, the quotes on admissibility. 

The main goal of this iraper is to (‘xplain why Theorem 1.1 is correct by exam- 
ining the argument u.sed to prove the result. The starting |X)int of the argument 
Ls that the Blyth Stein condition that involves the IRD provides a suHicient con- 
dition for admissibility. Becau.se this condition is .somewhat hard to verify directly, 
it is often the case that a simpler condition Ls provided via an application of the 
Cauchy -Stdiwarz Inetiuality. In the development here, this path leads rather natu- 
rally to a mathematical object <-alle<l a Dirichlet form. Now, the connection iK'tWf-en 
the resulting Dirichlet form, the ii.s.s<iciated chain with the transition function R in 
Theorem 1.1, and the recurrence of the chain is fairly easy to explain. 

In brief, this paper is organized as follows. In Section 2. the Blyth Stein condi- 
tion Ls described anti the basic inequality that leatls to the as.st>ciated Dirichlet form 
is presented. In Section 3 the Ivickgruunti material (mainly from the Appendix in 
Eaton (1992)) that relates the Markov chain to the Dirichlet form is describtxl. The 
conclusion of Theorem 1.1 is immediate once the connections above are established. 

The application t»f Theorem 1.1 in particular examples is typically not easy 
- primarily becaus»‘ establishing the recurrence of a non-trivial Markov chain Ls 
not easy. Examples related to the Pitman estimator of a traiLslation parameter 
are dis«'U.ssed in Section 4. The fact that the Chung-Fuchs (1951) Theorem is ased 
here supports the contention that interesting examples are not routine applications 
of general theory. Also in Section 4. a recent result of Lai (199(1) concerning the 
multivariate normal translation model is describ*^!. 
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A detailed proof of Theorem 3.2 is given in an appendix to this paper. The 
conclusion of Theorem 3.2 is hinted at in Eaton (1992), but a rigorous proof is 
rather more involved than I originally Irelieverl it would be. Thus the careful proof 
here. 

Although the Markov chain of interest here has the parameter space as its state 
space, some interesting related work of Robert and Robert (1999) use a related 
chain on the sample space in some examples where the two spaces are both subsets 
of the real line. 

2. The Blyth-Stein condition 

Here are some basic assumptions that arc to hold throughout this paper. The sam- 
ple space X and the parameter space B are both Polish spaces with their respective 
<T-algebras of Borel sets. All functions are as.sumed to be appropriately mea.surable. 
The statistical model for X € A" is {P(-|O)|0 6 0} and the improper prior distril>- 
ution 1 / is assumed to t>e u-finite on the Borel sets of 0. The marginal measure M 
on X is dcKned by 

A/(B)= / P{B\e)u{de). (2.1) 

Because 1/(0) — -l-oo it is clear that M{X) = -l-oo. However, in some interesting 
examples, the measure A/ is not <T-finite and this prevents the existence of a formal 
posterior distribution (For example, look at X = {0, 1, • • • ,m}, the model is Bino- 
mial (m, 9) and u^dO) = [0(1 — 0)]“*e/0 on (0,1). No formal posterior exists here). In 
ail that follows the measure Af is assumed to be c-finite. In this case, there exists 
a proper conditional distribution Q{d0\x) for 9 given X = x which satisfies 

P{dx\9Md9) = Q{de\x)M(dx). (2.2) 

Equation (2.2) means that the two joint measures on A" x B agree. Further, Q( |r) 
is unique almast everywhere A/. For more discussion of this, see Jolntson (1991). 

Given the formal |xjsterior, Q(-|x), the formal Bayes estimator for any bounded 
function <t>{9) when the loss Ls quadratic is the pewterior mean 

0(x) = j 0(0)Q(d0|x). (2.3) 

The risk function of this estimator is 

R(^,9) = £g[i{X) - (2-4) 

where Eg denotes expectation with respect to the model. Because d> is bounded. 

exists and R(i,9) is a bounded function of 9. The bounded assumption on <j> 
simplifies the di.scu.ssion enormously and allows one to focus on the essentials of the 
admissibility-recurrence connection. For a version of this material that is general 
enough to handle the estimation of unbounded 0’s, see Eaton (2001). 

The appropriate notion of “admissibility" for our di.scu.ssion here is captim>d in 
the following definition due to C. Stein. 

Definition 2.1. The estimator 0 is almosl-i/-admissible if for any other estimator 
i{X) that satisfies 


R(t,9) < R{(t),9) for all 0, 

(2.5) 

B= {0|B(t,0) < fi(0,0)} 

(2.6) 


has i/-measure zero. 
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Definition 2.2. The formal posterior Q(-|i) is strongly admissible if the «?stima- 
tor ^ is almost-i/-admis.sible for ever>' bounded function <t>. 

The notion of strong adinis-sibility is intended to capture a robustness proi>erty 
of the formal Bayes method across problems - at least for quadratic loss estimation 
problems when <j> is bounded. The soft argument is that Q(-|i) cannot be too badly 
behaved if 0 is almost-r/-admissible for all bounded <f>- 

To describe a convenient version of the Blyth Stein conditions for almost-i/- 
admissibility, consider a bomided function g > 0 defined on © and satisfying 

0 < g(0)v(d0) = c< + 00 . (2.7) 

Now Vg(dB) = g[0)v(d0) is a finite measure on © .so we can write 

P(dx\e)og{de) = Qg{dB\x)Mg(dx) (2.8) 

where Mg is the marginal measure defined by 

Mg{dx) = J P{dx\g)ug{de). (2.9) 

Of course, Qg(d9\x) is a version of the conditional distribution of 6 given X = x 
wheti the proper prior distribution of 0 is c~^Og. Setting 

g{x) = J g{0)Q(d0\x). (2.10) 

it is not hard to show that 


Mg{dx) = g(x)M{dx). (2.11) 

Since the set {z|g(i) = 0} has A/j-measure zero, it follows that a version of Qg{d0\x) 
is 


QgWx) = 


i^Q{d0\x), if g(i)>0, 
9{x) 

Q{d0\x). if g{x) = 0. 


( 2 . 12 ) 


In all that follows, (2.12) is used as the conditional distribution of 6 given X = x 
when the prior distribution is i/g. 

Now, the Bayes estimator for 0(0), given the |x>sterior (2.12), is 


0,(z) = y 0(0)Q,(d0|x) (2.13) 

whose risk function is 

R{^g,0) = £e[ig{X)-m]'- (2.14) 

The so called Integrated Risk Difference, 

IRD(g) = J [R{^0) - R{^g,0)]g{0)u(d0) (2.15) 


plays a key 
To describe 
and let 


role in the Blyth- Stein condition for the almost-i/-admissibility of 0. 
this condition, coasider a measurable set C C © with 0 < i'(C) < +oo 


U(C) ={g>0 


g is bounded. g{9) > 1 for 
and y g(0)u(dO) < +oo 


0 6C, I 


(2.16) 
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Theorem 2.1 (Blyth-Stein). Assume 

There is a sequence of sets C{ C Ci+i C 0, t = 1, • • • with 

0 < I'iCi) < +00 and Ci y G so that (2.17) 

inf IRD(g) = 0 /ori=l,2, ••• 

9€U{Ci) 

Then 0 is ahnost-v -admissible. 

The proof of this well known result is not repeated here. The usual interpretation 
of Theorem 2.1 is that when 0 is “close enough to a proper Bayes rule then ^ 
is ahnost-i/-admissible, but the notion of closeness is at best rather vague. 

A po.ssible first step in trying to apply Theorem 2.1 is to find a tractable (and 
fairly sharp) upper bound for lRD{g) in (2.15). Here is the key inequality that 
allows one to see eventually why “recurrence” implies strong-admissibility. 

Theorem 2.2. For a real valued measurable function h defined on 0, let 

A(/j) = JJJ W) - h{7j)fQ{de\x)Q{dTj\x)M{dx). (2.18) 

Then for exteh botmded function (f>, there is constant so that 

IRD{g)<IUt\{y/g), (2.19) 

for all bounded non-negative g satisfying f g{9)u{d0) < +oc. 

Proof. A dirt^ct proof of (2.19) using the Cauchy -Schwarz Inequality follows. First, 
let A = (a:|. 9 (x) > 0} and recall that A^ has Mg measure zero. Thus, 

IRD{g) = |^(0(x) - <p{6)f - (^g{x) - 0(^))''j P{dx\e)g{0)u{d0) 

^0(x) - 0(61)) - (J>g{x) - 0(^)) I Qg(de\x)Mg{dx) 


= LL [(' 

= - 0»(x)) g(x)M{dx) 

^ L " f(^) \{x)M{dx) 


( 2 . 20 ) 


M{dx). 


A bit of algebra shows that for each x, 

m Ui(0)-!i{x))Q{M\x) 

~ !'('')) Q(M\x)Q(dri\x). 

Using the non-negativity of g and the Cauchy-Schwarz inequality we have 

(0(0) - 0(r/)) (.9(0) - g{7])) Q{de\x)Q{dT]\x) 


L 


III' 


< H’(x-) 


jj [\/Wi - \/^) Q{de\x)Q{di]\x) 
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where 


w^x)= jj + 


Since <(> is bounded, say \<t>(0)\ < co, and since ( \/g{0) + \/g(ri) < 2{g(0) + g(g)), 


Combining Theorem 2.1 and Tlieoreni 2.2 gives the main result of this section. 
Theorem 2.3. Assume 

( There is a sequence of increasing sets C, C S, j = 1, 2. . . . 

with 0 < t'(Cj) < +0O and Ci /' S so that (2.21) 

inf A (./o) = 0, for each i. 

S€U(C.) 

Then Q{d6\x) is strongly admissible. 

Proof. When (2.21) holds, inequality (2.19) shows that (2.17) holds for each bounded 
measurable <f>. Then Q{d0\x) is strongly admissible. □ 

It should be noted that the assumption (2.21) does not involve ^ (as opposed to 
assumption (2.17)). Thus the conditioas for strong admissibility involve the behav- 
ior of A. It is exactly the functional A that provides the connection between (2.21) 
and the “recurrence" of the Markov chain with transition function li in (1.1). 

To put the material of the next section in perspective, it is now useful to isolate 
some of the essential features of the decision theory problem described above - 
namely, under what conditions on the given model P{dx\0) and the improper prior 
i^idB) with the formal posterior Q{d6\x) be strongly admissible? A basic ingredient 
in our discussion will be the transition function 


introduced in Section 1. A fundamental property of R is its symmetry with respect 
to t/ - that is, the measure on B x 0 defined by 


we have 


H'^(x) < 4cgj(x). 


Substituting these bounds into the final expression in (2.20) yields 


IRD(g) < 4clJJJ[^)-yMn)yQ(de\x)Q(dn\x)M{dx) 


< 4c2A(v/9). 


Setting = 4co yields the result. 


□ 



( 2 . 22 ) 


s(d0,dr)) = R(d0\q)v(dq). 


(2.23) 


is a symmetric measure in the .sen.se that 


s(A X B) 



s{B X A) 


(2.24) 
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for Borel subsets A and B of 0. This is easily established from the definition of R. It 
is this symmetry that drives the theory of the next section and allows us to connect 
the behavior of A, namely 

A(/i) = jf (h(») - Hr,)f IHM\r,)u[d,,), (2,25) 

to the “recurrence” of the Markov chain defined by R. The expression (2.25) for A 
follows from (2.18) and the disintegration formula (2.2). 

Also, note that i/ is a stationary measure for R - that is, 

J R{A\T])u{d‘r]) = u{A) (2.26) 

for all Borel sets A. This is an easy consequence of the symmetry of s in (2.23). 

The discassion in the next section begins with an abstraction of the above 
observations. Much of the discussion is based on the Appendix in Eaton (1992). 

Here is the standard Pitman example that gives a concrete non-trivial example 
of what the above formulation yields. 

Example 2.1. Consider Xi,...,X„ that are independent and identically distril> 
uted random vectors in R*’ with a density f{x — 6) (with respect to Lebesgue 
measure). Thus 0 = 7?^' and the model for X = (Xi, . . . , X„) is 

n 

P{dx\0) = 1^ f{xi — 6)dxi 

i-\ 

on the sample space X = R}”\ With dx as Lebesgue measure on X, the density of 
P{dx\0) with respect to dx is 


»=i 

Next take u{d6) = dO on 0 = 7?^' and assume, for simplicity, that 

m{x) = j p{x\6)dd 
Jrp 

is in (0, oo) for all x. Then a version of “Q(d^|x)” is 

Thus the transition function R is given by 

p{x\e)p{x\T}) 


R{d6\v) 

Therefore, 

where the deiusity r(-|r/) is 


■a 


AT m{x) 
R{d6\t]) = r{6\i])d6 


dO. 


mv) = f 

Jx 


pjx\9) p{x\r)) 
m{x) 


dx. 
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Now, it is easy to show that for each vector u e R’’, 
r(0 + u|r; + ii) = r{0\r)) 
so that r is only a function of 0 - t;, say 

t(0 - ri) = r{0 - r;|0). 

Further routine calculations give 

( t(u) = t(—u) for u 6 f?'' 

J t(u)du = 1 

In summary then, for tlie translation model with dO as tlie impro|)er prior <llstril>- 
ution, the induced transition function is 

R{(W\t)) = t(0 - ri)d9 

and t is a symmetric deiLsity function on R’’. We will return to this examine later. 
3. Symmetric Markov chains 

Here, a brief sketch of symmetric Markov chain theory, recurrence and Dirichlet 
forms is given. The purpose of this section is two-fold - first to explain the rela- 
tionsliip between recurrence and the Diri(lilet form and second to relate this to the 
strong admissibility result of Theorem 2.3. 

Let T be a Polish space with the Borel <T-algebra B and consider a Markov 
Kernel R{dy\z) on B x Y. Al.so let A l>e a non-zero tr-finite measure on B. 

Definition 3.1. The kernel R{dy\z) is X-symmetric if the measure 

a(dy.dz) = R(dy\z)X{dz) (3.1) 

is a symmetric measure on B x B. 

Typically, R is called symmetric without reference to X since X is fixed in most 
discussioas. As the construction in Section 2 shows, interesting examples of sym- 
metric kernels abound in .statistical dedsion theory. In all that follows, it Is assumed 
that R is A-symmetric. Note that the assumption of <T-finiteness for A is im|>ortant. 
Given a A-symmetric R. consider a real valued measurable function h and let 

Mh) = jj (Ky) - h(z)fR{dy\z)X{,dz). (3.2) 

The quadratic form A (or sometimes ^A) is often called a Dirichlet form. Such 

forms are intimately connected with continuous time Markov Process Theory (s<x* 
FukiLshima et al (1994)) and also have playfxl a role in some work on Markov 
chains (for example, see Diaconis and Strook (1991)). A routine calculation using 
the .symmetry of R shows that 

A(/.) < 4 j h^(y)X(dy) (3.3) 

so A is finite for h € L 2 (A), the space of A-square integrable functions. 
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Now, given R{dy\z), there is a Markov chain with state space Y and transition 
function R{dy\z). More precisely, consider the path space W = = Y x Y x • • • 

with the usual product cr-algebra. Given the initial value wq, there is a Markov 
chain U" = (wo, fV’i, U 2 » . . .) so that R(dwi+ifwi) is the conditional distribution of 
VV’j+i given IV',- = Wi, for i = 0, 1,2, — The unique probability measure on path 
space that is consistent with this Markov specification, is denoted by S{-\wq). 

Because the space Y is rather general, the definition of recurrence luis to be 
selecrted with some care. The reader should note that neither irreducibility nor 
jjeriodicity occur in the discussion that follows (see Meyn and Tweedie (1993) for 
a discussion of such things in the general state space case). Let C C Y satisfy 
0 < A(C) < + 00 . Such measurable sets are called X-proper. Define the random 


variable Tc on W as follows: 




... 

' -t-oc 
n 

if Wi i C 
if IV'i G C 
if w;, € C 

for i = 1,2,. . . 
for some n > 2 and 

(3.4) 



Wi i c 

for 7 = 1, . . . , n - 1 



Then Tc ignores the starting value of the chain and records the first hitting time 
of C for times greater than 0. The set 


Be = {Tc < +oo} (3.5) 

is the event where the chain hits C at some time after time 0. 

Definition 3.2. A A-proper set C C T is called locally -X-ivcurrent if the set 

B() = {w^o € G|5(Bdt/;o) <1} 

has A-measure zero. 


Definition 3.3. A A-proper set C C T is called X-recuri'cnt if the set 

Di = |ti;o|5(Bcko) < 1} 

has A-measure zero. 


In other words, C is locally-i/-recurrent if whenever the chain starts in C, it 
returns to C w.p.l, except for a set of starting values of A-measure zero. It is this 
notion of recurrence that is most relevant for admi.ssibility considerations. Of course, 
C is A-recurrent if the chain hits C no matter where it starts, except for a set of 
starting values of A-measure zero. This second notion is closer to traditional ideas 
related to recurrence. 

To (h'seribe the connection between the Dirichlet form A and local- A-recurrence, 
consider 


V(C) = 


h € Z,2(A) 


h > 0, h{y) > 1 
h is boundetl 


for y € C, 


(3.6) 


Note that U (C ) in (2.16) and V (C) are in one-to-one correspondence via the relation 

fi{y) = y^Y- 


Theorem 3.1. For a X-proper set C, 


inf A(/j) = 2 / (1 - 5(Bc|t/^))A(dw). 
/»ev(C) )c 


(3.7) 
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A proof of this basic result can be found in Appendix 2 of Eaton (1992). 
From (3.7), it Ls immediately obvious that C is a locally- A-recurrent .set iff the 
inf over V{C) of the Dirichlet form A is zero. 

Definition 3.4. The Markov chain W = (IVo, W’l, U • • •) is locally- \-recurrent if 
each A-proper set C is locally- A-recurrent. 

In applications, it is useful to have some conditions that imply local- A-recurrence 
since the verification that every A-proper C is locally- A-recurrent can be onerous. 
To this end, we have 

Theorem 3.2. The following are equivalent: 

(i) The Markov chain W = (IFo, IFi, 1^2 , ...) is locally- X-recuTrent 

(ii) There exists an increasing sequence of X-proper sets Ci, i = 1, . . . such that 
Ci — ► Y and each Ci is locally-X-recurrent 

Proof Obviously (i) implies (ii). The converse is proved in the appendix. □ 

In a variety of decision theory problems, it is often sufficient to find one set 
Bo that is “recurrent” in order to establish “admissibility.” For an example of 
the “one-set phenomenon,” see Brown and Hwang (1982). In the current Markov 
chain context, here is a “one-set” condition that implies local- A-recurrence for the 
chain W. 

Theorem 3.3. Suppose there exists a X-proper set Do that is X-recurrent (see 
Definition 3.3). Then the Markov chain W ts locally-u -recurrent. 

Proof. Since A is a-finite, there is a sequence of increasing A-proper .sets Bi,i = 
1,2,... such that Di — ► Y. Let Ci = Dil) Bo, i = 1,2, ... so the .sets Ci are 
A-proper, are increasing, and Ci — > Y. The first claim is that each Ci is locally- A- 
recurrent. To see this, let N be the A-null set where 5(T/j„ < -l-(X)|u;) < 1. Then for 
w ^ N, the chain hits Bo w.p.l after time 0 when H^o = tv. Thus, for w ^ N, the 
chain hits BoUBi w.p.l after time 0 when H'o = w. Therefore the set C, = BoUDi 
is locally- A-recurrent. By Theorem 3.2, VF is locally- A-recurrent. □ 

The application of the above results to the strong-admi.ssibility problem is 
straightforward. In the context of Section 2, consider a given model P{dx\6) and a 
cr-finite improper prior distribution u{d6) so that the marginal measure M in (2.1) is 
<T-finite. This allows us to define the transition /2(d0|r;) in (2.22) that is //-symmetric. 
Therefore the above theory applies to the Markov chain W = (H'o) ^Y\,W 2 , . . .) on 
0* defined by R{dO\r]). Here is the main result that establishes “Theorem 1.1” 
stated in the introductory section of this paper. 

Theorem 3.4. Suppose the Markov chain W with state space B and transition 
function R{dO\rj) is locally-u -recurrent. Then the posterior distribution Q{d6\x) de- 
fined in (2.2) is strongly- admissible. 

Proof. Because W is locally-//- recurrent, the infimum in (3.7) is zero for each u- 
proper set C. This implies that condition (2.21) in Theorem 2.3 holds. Thus, Q{d9\x) 
is strongly admissible. □ 

Of course Theorem 3.2 makes it a bit easier to show IF is locally-//-recurrent, 
while Theorem 3.3 provides an extremely useful sufficient condition for this property 
of IF. An application is given in the next section. 
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4. Examples 

Here we focus on two related examples. The first is based on the Pitman model 
introduced in Example 2.1. In this case, the induced Markov cliain is a random 
walk on the parameter space, and as is well known, under rather mild moment 
conditions for dimensions p = 1 and p = 2, the random walks are recurrent. But 
for p > 3, there are no recurrent random walks on W’ that have densities with 
respect to Lebesque measure. Of course this parallels what decision theory yields 
for admissibility of the Pitman estimator of a mean vector - adniLssibility for p = 1 
and p = 2 (under mild moment conditions) and inadmissibility in many examples 
when p > 3. The results here do not concern estimation of a mean vector, but 
rather involve the strong admissibility of the posterior, and again the dimension 
phenomenon prevails. 

The se<*ond example is from the thesis of Lai (1996) and concerns the p-dimens- 
ional normal distribution with an unknown mean vector and the identity covariance 
matrix. In essence, Lai’s results provide information regarding a class of improper 
priors that yield strong admissibility when the parameter space is Even in 
the case of the normal distribution there are many open questions concerning the 
“inappropriateness” of the posterior when the improper prior is dO on W’y p > 3. 


Example 4.1 (continued). As shown in Section 2, the induced transition function 
on the panuneter space 0 = R)' is given by 

R{de\Ti) = t{e-i))de (4.i) 


where the density function t is defined in Example 2.1. The Markov chain induced 
by R is just a random walk on R^. When p = 1, the results of Chung and Fuchs 
( 1951 ) apply directly. In particular, if p = 1 and 



|u|i(«) du < + 00 , 


(4.2) 


then the Markov chain is recurrent and so the posterior distribution in this case 
is strongly admissible. A sufficient condition for (4.2) to hold is that the original 
density function / in Example 2.1 has a finite mean (see Eaton (1992) for details). 

When p = 2, a Chung-Puchs-like argument also applies (see Revuz (1984)). In 
particular, if 

f ||w||^<(u)du < + 00 , (4.3) 

Jr3 

then the Markov chain on R^ is recurrent so strong admissibility obtains. Again, it 
is not too hard to show that the existence of second moments under / in Example 
2.1 imply that (4.3) holds. These results for p = 1,2 are suggested by the work of 
Stein (1959) and .James and Stein (1961). 

For p >3, the Markov chain obtained from R in (4.1) can never be recurrent (.see 
Guivarc’h, Keane, and Roynette (1977)) suggesting that the posterior distribution 
obtained from the improper prior dO on S = R}^ is suspect. At present, the question 
of “inadniLssibility” of the posterior when p > 3 remains largely open. This ends 
our di.scus.sion of Example 2.1. 

Example 4.2. The material in this example is based on the work of Lai (1996). 
Suppose X is a />-dimensional random vector with a normal distribution Np{9,Ip). 
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Here 0 € 0 = /?'' is unknowTi and the covariance matrix of X is the p x p identity. 
Consider an improper prior distribution of tiie form 

i^(d0) = (a + 116(f)" dO (4.4) 

where the constant a is positive and q is a real number. In this setting Lai proved 
the following. 

Theorem 4.1 (Lai (1996)). //q 6 (— oo, — ^ + 1], then the posterior distribution 
for 6 is strongly admissible. 

The above follows from the more general Theorem 5.3.3 in Lai (1996), but 
well illustrates the use of the Markov chain techniques. Lai's argument consists of 
proving that for the range of q indicated, the induced Markov chain on 0 is locally- 
i/-recurrent so strong admissibility obtains. In fact, the Markov chain techniques 
developed by Lai to handle this example include extensions of some recurrence cri- 
teria of Lamperti (1960) and an application of Theorem 3.3 given above. Although 
the class of priors in (4.4) is quite small, the exteasion of Theorem 4.1 to other 
improper priors can be verified via Remark 3.3 in Eaton (1992). In this |>articular 
example. Remark 3.3. coupled with Theorem 4.1, implies the following. 

Theorem 4.2. Consider a prior distribution v of the form (4.4) with a e < 
(— 00 ,—^ -I- 1] and let g(0) satisfy 

c < g(,0) < - for all 0 
c 

for some c > 0. Then the Markov chain induced by the prior distribution 

Ug(d0) = g(0)u(d0) (4.5) 

is locally-u -recurrent and the induced posterior distribution is strongly admissible. 

For applications of Lai’s ideas to the multivariate Poi.sson rase, we refer the 
reader to Lai’s thesis. This completes Example 4.1. 

Appendix 

Here we provide a proof of Theorem 3.2. To this end, consider a measurable subset 
C C K that is A-proper and let 

H(C)= inf A(/i). (A.l) 

/l€V'(C) 

Also, let 

V-(C) = {/.|/. € V'(C), h(y)€ [0,1] foryeC'-}. 

The results in Appendix 2 of Eaton (1992) show that 

//(C) = inf A(/i). (A.2) 

S€V-(C) 

Lemma 1.1. Consider measurable subsets A and B of Y that are both \-propcr. 

If AC B, then 

H^(A) < //>(//) < //2(A)4-2U>(/?n.4'’). (A.3) 
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Proof. Since V{A) 3 V{B), H{A) < H(B) so the left hand inequality in (A.3) is 
obvious. For the right hand inequality, first note that Ai is a subadditive function 
defined on /- 2 (A) that is, 

A»(/i, +/ 12 ) < Ai(/j,) + A>(/i 2 ). (A.4) 

A proof of (A.4) is given below. With /i;i = h\ A h 2 , (A.4) yields 

Ai(/i.-,)< Ai(/j,) + Ai(/i3-/i,). (A.5) 

for hi and hs in 1 . 2 (A). Now consider h € V'(A) and write 

Kv) - h{y) + 9(y) 

where 

g(y) = (I - h(y))lonA4y)- 
Then h 6 V'(B) and (A.5) implies that 


Tints, 


Becaitse g{y) G [0, 1), 


Ai(h) < A*(/i) + Ai(s). 
Hi(B)<Ai(h) + Ai(g). 


My) = jj(y{y)-9(i)fR(M^)Kd^) 

= 2 g^{y)X(dy) - Jj g{y)g{z)R(dy\z)X(,dz) 

< 2 / g^(y)A(dj/) <2A(BnA‘^). 


(A.6) 


Substituting this into (A.6) yields 


//i(B) < A*(/i) + 2iAi(BnA'^). (A.7) 


Since (A.7) holds for any h G V’(A), the right hand inequality in (A.3) holds. This 
completes the proof. □ 


The proof of (A.4) follows. For /ii and /12 >n Lz^X), consider the symmetric 
bilinear form 

</n,/i 2 >= J hi(y)hz(y)X(dy)- Jj hi{y)hz(z)R(dy\z)X(dz). 


That < ■, ■ > is non-negative definite is a consequence of the symmetry of R and 
the Cauchy Schwarz inequality: 


(// 


>)■ 


hi(y)hi(z)R{dy\z)X[dz) 

<m h:i(y)R(dy\z)X{dz)^ ■ (^J j h?(z)fl(dy|z)A(dz)) 

= ^Uy)Mdy)^ ■ 
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Thus \\h\\ =< h, h is a semi-norm on /. 2 (A) and so is subadclitive. Since A{h) = 
2||/i|p, inequality (A.4) holds. 

The proof of TheH)rem 3.2 is now etisy. Lot C be any A-proper set so A(C) < -l-oo 
and let 

Ei=CiDC, i = l,2, 

Since Ci is locally- A-re<’urrent, H{Ci) = 0 so H(Ei) = 0 by Lemma A.l. Since 
Ei Z' C and A(C) < -hoo, we have 

lim \{Ei) — ^ A(C) 

i 

and 

lim A(C n El) — > 0. 

Applying (A. 3) yields 


II ^(C) < IlHEi) + 2^\{CnEl). 

The right hand side of this inequality converges to zero as i — * 00 . Hence H{C) = 0. 

Since C was an arbitrary A-proper set, the chain W is locally-t'-recurrent. 
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Abstract; In this review of estimation problems in restricted parameter spaces, 
we ft>cus through a series of illustrations on a number of methods that have 
proven to be successful. These methods relate to the decision-theoretic aspects 
of admissibility and minimaxity, as well as to the determination of dominating 
estimators of inadmissible procedures obtained for instance from the criteria of 
unbiasedness, maximum likelihood, or minimum risk equivariance. Finally, we 
accompany the presentation of these methods with various related historical 
developments. 

1. Introduction 

Herman Rubin has contributed in deep and original ways to statistical theory and 
philosophy. He has selflessly shared his keen intuition into and extensive knowledge 
of mathematics and statistics with many of the researchers represented in this 
volume. The statistical community has been vastly enriched by his contributions 
through his own research and through liLs influence, direct and indirect, on the 
research and thinking of others. We are pleased to join in this celebration in honor 
of Professor Rubin. 

This review paper is concerned with estimation of a parameter or vector para- 
meter 0 , when 0 is restricted to lie in some (proper) subset of the “usual" parameter 
space. The approach is decision theoretic. Hence, we will not be concerned with hy- 
pothesis testing problems, or with algorithmic problems of calculating maximum 
likelihood estimators. Excellent and exteitsive sources of information on these as- 
pects of restricted inference are given by Robertson, Wright and Dykstra (1988), 
Akkerboom (1990), and Barlow', Bartholomew, Bremner and Brunk (1972). We will 
not focus either on the important topic of interval estimation. Along with the recent 
review paper by Mandelkern (2002), here is a selection of interesting work concern- 
ing methods for confidence intervals, for either interval bounded, lower bounded, 
or order restricted parameters: Zeytinoglu mid Mintz (1984, 1988), Stark (1992), 
Hwang and Peddada (1994), Drees (1999), Kamboreva and Mintz (1999), Ilio|xmlos 
and Kourouklis (2000), and Zhmig and Woodroofe (2003). 

We will focus mostly on point estimation and we will particularly emphasize 
fiiiding estimators which dominate classical estimators such as the Maximum Like- 
lihood or UMVU estimator in the unrestricted problem. Issues of minimaxity and 
admissibility will also naturally arise and be of interest. Suppose, for example, that 
the problem is a location parmneter problem and that the restricted (and of course 
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the original space) is non-compact. In this case it often happens that these classical 
estimators are minimax in both the original problem and the restricted problem. If 
the restriction is to a convex subset, projection of the classical procedure onto the 
space will typically produce an improved minimax procedure, but the resulting pro- 
cedure will usually not be admissible because of violation of technical smoothness 
requirements. In these cases there is a natural interest in finding minimax gener- 
alized Bayes estimators. The original result in this setting is that of Katz (1961) 
who showed (among other things) for the normal location problem with the mean 
restricted to be non-negative, that the generalized Bayes estimator with respect to 
the uniform prior (under quadratic loss) is minimax and admissible and dominates 
the usual (unrestricted ML or UMVU) estimator. Much of what follows has Katz’s 
result as an examplar. A great deal of the material in sections 2, 3 and 5 is focussed 
on extending aspects of Katz’s result. 

If, in the above normal example, the restricted space is a compact interval, then 
the projection of the usual estimator still dominates the unrestricted MLE but can- 
not be minimax for quadratic loss because it is not Bayes. In this case Casella and 
Strawderman (1981) and Zinzius (1981) showed that the unique minimax estima- 
tor of the mean 6 for a restriction of the form 0 € [— m, m] is the Bayes estimator 
corresponding to a 2 point prior on {-m, m) for m sufficiently small. The material 
in section 6 deals with this result, and the large related literature that has followed. 

In many problems, as in the previous paragraph, Bayes or Generalized Bayes 
estimators are known to form a complete class. When loss is quadratic and the 
prior (and hence typically the posterior) distribution is not degenerate at a point, 
the Bayes estimator cannot take values on the boundary of the parameter space. 
There are many results in the literature that use this phenomenon to determine 
inadmissibility of certain estimators that take values on (or near) the boundary. 
Moors (1985) developed a useful technique which has been employed by a number of 
authors in proving inadmissibility and finding improved estimators. We investigate 
this technique and the related literature in section 4. 

An interesting and important issue to which we will not devote much effort is the 
amount of (relative or absolute) improvement in risk obtained by using procedures 
which take the restrictions on the parameter space into account. In certain situations 
the improvement is substantial. For example, if we know in a normal problem that 
the variance of the sample mean is 1 and that the population mean 6 is positive, 
then risk, at 0 = 0, of the (restricted) MLE is 0.5, so there is a 50% savings in risk 
(at 9 = 0). Interestingly, at = 0, the risk of the Bayes estimator corresponding to 
the uniform prior on (0, oo) is equal to the risk of the MLE so there is no savings 
in risk at = 0. There is, however, noticeable improvement .some distance from 
0 = 0. An interesting open problem is to find admissible minimax estimators in this 
setting which do not have the same risk at 0 = 0 as the unrestricted MLE, and, 
in particular, to find an admissible minimax estimator dominating the restricted 
MLE. 

We will concern ourselves primarily with methods that have proven to be suc- 
cessful in such problems, and somewhat less so with cataloguing the vast collection 
of results that have appeared in the literature. In particular, we will concentrate 
on the following methods. 

In Section 2, we describe a recent result of Hartigan (2003). He shows, if X ~ 
Np{9,Ip), loss is L{6,d) = \\d — 0|p, and 9 E C, where C is any convex set (with 
non empty interior), then the Bayes estimator with respect to the uniform prior 
distrib\ition on C dominates the (unrestricted MRE, UMVU, unrestricted ML) 
estimator <5o(X) = X. Hartigan’s result adds a great deal to what was already 
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known and provides a clever new technique for demonstrating domination. 

In Section 3, we study the Integral Expression of Risk Difference (lERD) method 
introduced by Kubokawa (1994a). The method is quite general as regards to loss 
function and underlying distribution. It has proven useful in tmrestricted as well 
as restricted parameter spaces. In particular, one of its first iLses was to produce 
an estimator dominating the James-Stein estimator of a multivariate normal mean 
under squared error loss. 

In Section 4, following a discussion on questions of admissibility concerning 
estimators that take values on the boundary of a restricted parameter space, we in- 
vestigate a technique of Moors (1985) which is useful in constructing improvements 
to .such estimators under squared error loss. 

Section 5 deals with estimating parameters in the presence of additional infor- 
mation. For example, suppose A'l and X 2 are multivariate normal variates with 
unknown meaiLs 9i and O 2 , and known covariance matrices a^I and <t|/. W'e wish 
to estimate 9i with squared error loss ||<5 - when we know for example that 
01 —02 € .4 for some set A. We illustrate the application of a rotation technique, us<sl 
perhaixi first by Blumenthal and Cohen (1968a), as well as Cohen and Sackrowitz 
(1970), which, loosely described, permits to subdivide the estimation problem into 
parts that can be separately handled. 

Section 6 deals with minimaxity, and particularly those results related to Casella 
and Strawderman (1981) and Zinzius (1981) establishing minimaxity of Bayes esti- 
mators relative to 2 point priors on the boundary of a sufficiently small one dimen- 
sional parameter space of the form [a, 6). 

2. Hartigan’s result 

Let X ~ Np(9, Ip), 9 e C where C is an arbitrary convex set in S'’ with an open 
interior. For estimating 0 under squared error loss. Hartigan (2003) recently proved 
the striking result that the (Generalized) Bayes estimator relative to the uniform 
prior distribution on C dominates the usual (unrestricted) MRE estimator X. It 
seems quite fitting to begin our review of methods vtseful in restricted parameter 
spaces by disruLssing this, the newest of available techniques. Below, V and will 
denote respectively the gradient and Laplacian operators. 

Theorem 1 (Hartigan, 2003). For X ~ Np(9, Ip), 9 £C unlh C being a convex 
subset ofW with a non-empty interior, the Bayes estimator Sir{X) with respect to 
a uniform prior on C dominates <5u(.X’) = X under squared error loss ||(/ — 0||^. 

Proof. Writing 


Su(X) = X+ with m{X) = (2ir)-'’/2 f do, 

m{X) Jc 


we have following Stein (1981), 
R(9,Su(X))~ R(9,6o(X)) 
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= Ee 


[||Vxm(X)f m{X)Vim{X)-\\Vxm{X)f (X - 0)'Vxm(X)' 

m2(X) m2(X) m(X) 


= Ee 


m(X) 


H{X,e) 


where H{Xyd) = V^m(x) + (a: - 0)'Vxm(x). 

It suffices to sliow tht H{x,6) < 0 for all x € 3?^' and 6 E C. Now, observe 
that Vx(e- = -V^(e- and V2(e- = y2(e- 

so that 


{2tt)p/^H{x, 9) = Vl J e-^\\^-‘'\\^dv + {x-eyX:, j 

Jc Jc 

Jc 

= / div^[(d-j/)e-ill"^-‘'ll"] di/. (1) 

Jc 

the Divergence theorem, this last expression gives us 


(27r)P/2//(a;,6>) = f //(//)'(<?- da(i/), (2) 

JdC 

where r]{i^) is the unit outward normal to C at u on dC, and d<r(i/) is the surface 
area Lebesgue measure on dC (for p = 1, see Example 1). Finally, since C is convex, 
the angle between the directions •q{u) and d — i/ for a boundary point u is obtuse, 
and we thus have r]{uy{9 — u) < 0, for 9 £ C, v £ dC, yielding the result. □ 


Remark 1. 


(a) If 9 belongs to the interior C° of C; (as in part (a) of Example 1); notice that 
r){vy{9 — u) < 0 a.e. cr(v), which implies H(x,9) < 0, for 9 eC° and x € 
and consequently R{9,6u{X)) < R{9,So{X)) for 9 € C°. 

(b) On the other hand, if C is a pointed cone at ^o; (as in part (b) of Exam- 
ple 1); then rj{uy{9o — u) = 0 for all i/ € dC which implies R{9o,Si;{X)) = 
R(9o,So(X)). 

As we describe below. Theorem 1 has previously been established for various 
.specific parameter spaces C. However, Hartigan’s result offers not only a unified 
and elegant proof, but also gives many non-trivial extensions with respect to the 
parameter space C. We pursue with the instructive illustration of a univariate 
restricted normal mean. 


Example 1. 

(a) For X ~ X(0, 1) with 9 £ C = [a,b], we have by (1), 


(27t) i//(x,0) = r 

Ja du 


< 0, for all d 6 [a, 6]. 
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This tells as that R(e,Su(X)) < R(0,So(X)) for all 6» e C = [a, 6). 

(b) For X ~ N(6, 1) with € C = (a, oo) (or C = (— oo, a)), it is easy to see that 
the development in part (a) remains valid with the exception that H(x, a) = 0 
for all I € S, which tells as that R{6,Su(X)) < R{0,Sn{X)) for 0 € C with 
equality iff 0 = a. 

The dominance result for the bounded normal mean in Example 1(a) was es- 
tablished by MacGibbon, Gatsonis and Strawderman (1987), in a different fash- 
ion, by means of Stein’s unbiased estimate of the difference in risks, and sign 
change arguments following Karlin (1957). The dominance result for the lower 
bounded normal mean in Example 1(b) was established by Katz (1961), where 
he also showed that Su(X) is a minimax and adniLssible estimator of 0.* No- 
tice that these results by thentselves lead to extensions of the parameter spaces 
C where St/(X) dominates <5o(A^), for instance to hyperrectangles of the form 
C = {0 6 S'” : Sj € [aj,6j]; t = l,...,p}, and to intersection of half-.spaces since 
such problems can be expres.sed as “product-s” of one-dinien.sional problems. 

Balls and cones in 3?'' are two particularly interesting classes of convex sets for 
which Hartigan's result gives new and useful information. It is known that for balls 
of sufficiently small radius, (see e.g., Marchand and Perron, 2001, and Section 4,3 
below), the uniform prior leads to dominating procedures (of the mle), but Harti- 
gan’s result implies that the uniform Bayes procedures always dominate (5o(-V) = X. 
Also, for certain types of cones such as intersections of half spaces, Katz’s result 
implies domination over X as previously mentioned. However, Hartigan’s result ap- 
plies to all cones, and, again, increases greatly the catalog of problems where the 
uniform Bayes procedure dominates X under squared error loss. 

Now, Hartigan’s result, as described above, although very general with respect 
to the choice of the convex itarameter space C, Ls nevertheless limited to; (i) normal 
models, (ii) squared error loss, (iii) the uniform prior as leading to a dominating 
Bayes estimator; and extensions in these tlwee directions are certainly of interest. 
Extensioas to general univariate location families and general location invariant 
losses are disemesed in Section 3.2. Finally, it is worth pointing out that in the 
context of Theorem 1, the maximum likelihood estimator <5jnle(-^)i which is the 
projection of X onto C, also dominates <5o(A) = X. Hence, dominating estimators 
of <5o(A) can be generated by convex linear combinations of Su(,X) and <5nile(''^)' 
Thus the inadmissibility itself is obvious but the technique and the generality are 
very original and new. 

3. Kubokawa’s method 

Kubokawa (1994a) introduced a powerful method, based on an integral exjiression 
of risk difference (lERD), to give a unified treatment of point and interval esti- 
mation of the powers of a scale parameter, including the particular case of the 
estimation of a normal variance. He also applied his method for deriving a class 
of estimators improving on the James-Stein estimator of a multivariate mean. As 
reviewed by Kubokawa (1998,1999), many other applications followed such as in: 
estimation of variance components, estimation of non-centrality parameters, lin- 
ear calibration, estimation of the ratio of scale parameters, estimation of location 
and scale parameters under order restrictions, and estimation of restricted location 

^Although the result is correct, the proof given by Katz has an error (see for instance van 
Eeden, 1995). 
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and scale parameters. As well, a particular strength resides in the flexibility of the 
method in handling various loss functions. 


S. 1 . Example 

Here is an illustration of Kubokawa’s IBRD method for an observable X generated 
from a location family density f»{x) = /o(i - 0), with known /o, where EalX] = 0, 
and Eb\X^\ < oo. For estimating 6, with squared error loss (d — 9)^, under the 
constraint 9 > a (hereafter, we will take o = 0 without loss of generality), we show 
that the Generalized Bayes estimator Su(X) with respect to the uniform prior 
7 t( 0) = l(o,oo)(®) dominates the MRE estimator 6q(X) = X. As a preliminary to 
the following dominance result, observe that Su(X) = X + hij(X), where 


hu{y) = 


J^"(9-y)fo(y-9)d9 


jrMy-o)d0 


- /-oo “/o(") 

Il^foiu)du 


= -£b[X|X < j,]; 


and that hy is clearly continuoiLS, nonincreasing, with hy^y) > — limy— oc Eo[X|X < 
y] = -£«(X1 = 0. 

Theorem 2. For the restricted parameter space 0 e 0 = [0. oo), and under squared 
error loss: 


(a) Estimators (5/,(X) = <5q(X) + h(X) with absolutely continuous, non-negative, 
nonincreasing h, dominate (5o(X) = X whenever h{x) < hy{x) (and Si, / SqJ; 

(b) The Generalized Bayes estimator Sy(X) dominates the MRE estimator So(X). 


Piixif. First, i>art (b) follows from part (a) and the above rnentionerl jiroperties 
of /i(i. Oltserving that the properties of h and hy imply lim h{y) = 0, and following 

V— 

Kubokawa (1994a), we have 


(x - 9)^ — (x + h{x) — 9)^ 


so that 


(x + *(!/) -0)"ir=x- 
I + h{y) - 9f dy 

2/ h'(y)(x + h(y)-9)dy, 

J X 


A„(9) = Eo[(X -9f -(X + h(X)-9f] 

= 2^ h'(,y){x + h{y)-9)dy 

= 2 {x + h(y)-9)fo(x-9)dx^ dy. 


I fo{x -9)dx 


Now, since h'(y) < 0 (h' exists a.e.), it suffices in order to prove that Ai,{9) > 
0; & > 0; to show that 


GH{y,9) 



(i + h{y) - 9)fo{x -9)dx <0 


Copyrighted material 


Esttmation in rrstricted parameter spaces 


27 


for all y, and 6 > 0. But, this is equivalent to 

+ Itiy) - 6)fo{x -9)dx ^ 

/“oo /o(i -0)dx 
^ f!!.^{u + h(y))fo{'i)du ^ g 
irjfoHdu 

<=> h(y) ^ — £o[-^l-^ £ Sf — ^1; for all y, and 0 > 0; 

«• /i(y) < mf {-EolXIX < y - 0|} ; for all y; 

<^h{y)<-Eo\X\X <y\ = hu(yy, for all y; 

given that £<)[A'|X < 2 ] is nondecreasing in z. This establishes part (a), and com- 
pletes the proof of the theorem. □ 

Remark 2. In Theorem 2, it is worth pointing out, and it follows immediately that 
Ghi.y-,0) < 0, for all y, with equality iif h = ha and 0 = 0, which indicates that, for 
the dominating estimators of Theorem 2, R(6,Sh(X)) < R{6,So{X)) with equality 
\S h = ha and 0 = 0. As a consequence, fails to dominate any of these 

other dominating estimators Si,(X), and this includes the case of the truncation of 
Sq{X) onto (0, 00 ), S*{X) = max(0, (5o(X)) (also see Section 4.4 for a discussion on 
a normal model <5"''(A)). 

3.2. Some related results to Theorem 2 

For general location family densities fa(x — 9), and invariant loss L{8,d) = p(d — 9) 
with strictly convex p, Farrell (1964) established: (i) part (b) of Theorem 2, and 
(ii) the minimaxity of 6a(X), and (iii) the admissibility of Sa(X) for squared error 
loss p. Using Kubokawa’s method, Marchand and Strawderman (2003,a) establish 
extensions of Theorem 2 (and of Farrell's result (i)) to strictly bowl-shaped losses 
p. They also show, for quite general (fo,p), that the constant risk of the MRE 
estimator So(X) matches the minimax risk. This impli(!s that dominating estimators 
of <5o(X), such as those in extensions of Theorem 2. which include Sa(X) and S'*’(X), 
are necessarily minimax for the restricted parameter space 0 = [0, 00 ). Marchand 
and Strawderman (2003, a.b) give similar developments for scale families, and for 
cases where the restriction on 0 is to an interval (a, 6]. Related work for various 
models and losses includes Jozani, Nematollahi and Shafie (2002), vati Eeden (2000, 
1995), Parsian and Sanjari (1997), Gajek and Kalu.szka (1995), Berry (1993), and 
Gupta and Rohatgi (1980), and many of the references contained in these papers. 

Finally, as previously mentioned, Kubokawa’s method has berm applied to a wide 
range of problems, but, in partictilar for problems with ordered scale or location 
parameters (also see Remark 4), results and proofs similar to Theorem 2 have l>een 
establishetl by Kubokawa and Saleh (1994), Kubokawa (1994b), and Ilioupoiilos 
(2000). 

4. Estimators that take values on the boundary of the parameter space 

Theoretical difficulties that arise in situations when estimating procedures take 
values on, or close to the boundary of constrained parameter spaces are well docu- 
mented. For instance. Sacks (1963), and Brown (1971), show for estimating under 
squared error loss a lower bomided tiormal mean 0 with known variance, that the 
maximum likelihood estimator is an inadmissible estimator of 0. More recently. 
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difficulties such as those encountered with interval estimates have been addressed 
in Mandelkern (2002). In this section, we briefly expand on questions of admissi- 
bility and on searches for improved procedures, but we mostly focus on a method 
put forth by Moors (1985) which is useful in providing explicit improvements of 
estimators that take values on, or close to the boundary of a restricted parameter 
space. 

4-1- Questions of admissibility 

Here is a simple example which illustrates why, in many cases, estimators that take 
values on the boundary of the parameter space are inadmissible under squartxl error 
loss. Take A' ~ Np{8, Ip) where 8 is restricted to a ball 6(m) = {0 6 3?'’: ||0|| < m}. 
Complete class results indicate that adniLssible estimators are necessarily Bayes for 
some prior k (supporter! on 6(m), or a subset of 0(m)). Now observe that prior 
and posterior pairs (ir, ir|i) must be supported on the same set, and that a Bayi-s 
estimator takes values i5i(x) = E{6\x) on the interior of the convex 6(m), as 
long as 7r|i, and hence ;r, is not degenerate at some boundary point 0o of 0(m). 
The conclusion is that non-degenerate estimators S{X) which take values on the 
Imundary of 0(m) (i.e., n{x : J(i) 6 9(0(m)} > 0, with /i as Lebesgue measure); 
which includes tor instance the MLE; are inadmissible under squared error loss. In 
a series of papers, Charras and van Eeden (1991a, 1991b, 1992, 1994) formalize the 
above argument for more general models, and also provide various results concerning 
tile admissibility and Bayesianity under squared error loss of boundary estimators 
in convex parameter spaces. U.seful sources of general complete class results, that 
apply for Imunded parameter spaces, are the books of Berger (1985), and Brown 
(1986). 

Remark 3. As an example where the prior and posterior do not always have the 
same support, and where the above argument does not apply, take X ~ Bi{n,0) 
with 8 6 [0, m|. Moreover, consider the MLE which takes values on the boundary 
of [0, rn]. It is well known that the MLE is admissible (under squared error loss) for 
rn = 1. Interestingly, again for .squared error loss, Charras and van Eeden (1991a) 
establish the admissibility of the MLE for cases where m < 2/n, while Puno (1991) 
establishes its inadmissibility for cases where m < 1 and m > 2/n. Interestingly 
and in contrast to squared error loss, Bayes estimators under absolute-value loss 
may well take values on the boundary of the parameter space. For instance, Iwasa 
and Moritani (1997) show, for a normal mean Ixtunded an interval [o, 6j (known 
standard deviation), that the MLE is a proper Bayes (and admissible) estimator 
under absolute-value loss. 

The method of Moors, described in detail in Moors (1985), and further ilhts- 
trated by Moors (1981) and Mcx>rs and van Houwelingen (1987), permits the con- 
struction of improved estimators under squared error loss of invariant estimators 
that take values on, or too close to the boundary of closed and convex parameter 
spares. We next pursue with an illustration of this method. 

4 . 2 . The method of Mcxyrs 

Illastrating Moors’ method, suppose an observable X is generated from a location 
family density f(x — 8) with known positive and symmetric /. For estimating 8 € 
0 = [— m,m] with squared error loss, consider invariant estimators (with respect 
to sign changes) which are of the form 

rf,(X) = <?(|X|)^. 
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The objective is to specify dominating estimators of (5j(X), for cases where 
<5j(X) takes values on or near the boundary {— m,m} (i.e., |m — g(x)| is “small” 
for some x). 

Decomixjse the risk of 6g{X) by conditioning on |A'| (i.e., the maximal invariant) 
to obtain (below, the notation represents the expectation with respect to |X|) 





X 

1^1 



- £i*'{ 


e^ + g'^{\X\)-2Ee 


/I Vl\ 

P^9(|.Y|) 


1^1 


]} 




where 


4 + 


/l|Xi(9) = OEs 


1^1 


\X\ 




/(|x|-e)-/(|X| + 9) 




[/(ixi-9)+/(m+0) 

(as in (6) below) by symmetry of /. Now, rewrite the risk as 

R(6,Sg{X)) = e'/' [e^ - 4,.|(fl)) + [(.9(1^1) - >l|A-|(0))'] , 


(3) 


to isolate with the second term the role of g, and to reflect the fact that the irerfor- 
mance of the estimator Sg(X), for 0 € [— m, m], is measured by the average distance 
(sCI-^l) ~ •^|A|(^))^ under f(x — 6). Continue by defining the as the convex 
hull of the set {y4|i|(9) : — m < 9 < m}. Coupled with the prior representation (3) 
of R{0,Sg{X)), we can now state the following result. 

Theorem 3. Suppose Sg{X) is an estimator such that fi{x g(lx|) ^ 
^|i|} > estimator 6g„(X) with go(|x|) being the projection of g(|xl) 

onto 4|j,| dominates Sg{X), with squared error loss under f, for 0 € O = [— m,m). 

Example 2. (Normal Case) Consider a normal model / with variance 1. We obtain 
■^|x|(®) = 5tanh(S|x|), and = [0, mtanh(m|x|)], since A\x\{0) is increasing 
in |9|. Consider an estimator Sg{X) such that p{x : g(|x|) > mtanh(m|x|)} > 
0. Theorem 3 tells us that Sg^{X), with go(l-^l) = niin(mtanh(m|X|),g(|,Y|)), 
dominates Sg(X). 

Here are some additional observations relatcrl to the previous example (but also 
applicable to the general case of this section). 

(i) The set j4|j,| = [0, m tanh(m|x|)] can be interpreted as yielding a complete 
class of invariant estimators with the upper envelope corresponding to the 
Bayes estimator Snc/(X) associated with the uniform prior on {— rn, m}. 


(ii) In Example 2, the dominating estimator Sgg(X) of Theorem 3 will be given 
by the Bayes estimator 6ou(X) if and only if mtanh(m|x|) < g(|x|), for all 
I. In particular, if Sg(X) = <5mip(X), with g(| |) = niin(|A’l, m), it is easy to 
verify that <5j„(X) = Sbit(X) iff m < 1. 

(iii) It is easy to see that improved estimators Sg-(X) of bg(X) can be constructed 
by projecting g(|i|) a little bit further onto the interior of .4|x|, namely by 

selecting g' such that - [fl'(|x|) + g(|x|)) < gn(|x|) whenever g(|x|) > ,go(|x|). 
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4.3. Some related work 

Interestingly, the dominance result in (iii) of the normal model MLE was previoiLsly 
established, in a diferent manner, by Casella and Strawderman (1981) (see also 
Section 6). As well, other dominating estimators here were provided numerically by 
Kempthorne (1988). 

For the multivariate version of Example 2: X ~ Np{ 6 ,Ip)\ (p > 1); with 
||0|1 < m, Marchand and Perron (2001) give dominating estimators of <5j„ie(^) un- 
der squared error loss ||d-0|p. Namely, using a similar risk decomposition as above, 
including argument (ii), they show that <5/?t/(X) (Bayes with respect to a boundary 
uniform prior) dominates ^nile(^) whenever m < By pursuing with additional 
risk decompositions, they obtain various other dominance results. In particular, it 
is shown that, for sufficiently small radius m, ^inle(^) flominated by all Bayesian 
estimators as.sociated with orthogonally invariant priors (which includes the uni- 
form Bayes estimator 6 u)- Finally, Marchand and Perron (2003) give extensions 
and robiustness results involving Sbu to spherically symmetric models, and Perron 
(2003) gives a similar treatment for the model X ~ Bi{n,0) with \0 — ^\ < m. 

4. 4 - Additional topics and the case of a positive normal mean 

Other methods have proven useful in assessing the performance of boundary esti- 
mators in constrained parameter spaces, as well as providing improvements. As an 
example, for the model X,- ~ Bin(nj,0j); i = with 9i < 62 <■■■< Ok, 

Sackrowitz and Strawderman (1974) investigated the admissibility (for various loss 
functions) of the MLE of (^i,...,0fc), while Sackrowitz (1982) provided improve- 
ments (under sum of squared error losses) to the MLE in the cases above where it 
is inadmissible. Further examples consist of a series of papers by Shao and Straw- 
derman (1994, 1996a, 1996b) where, in various models, improvements under squared 
error loss to truncated estimators are obtained. Further related historical develoj)- 
ments are given in the review paper of van Eeden (1996). 

We conclude this section by expanding upon the case of a positive (or lower- 
boundetl) normal mean 9, for X ~ N{9,1),9 > 0. While a plausible and natural 
estimator is given by the MLE max(0, X), its efficiency requires examination per- 
hajxs because it discards part of the sufficient statistic X (i.e., the MLE gives a 
constant estimate on the region X < 0). Moreover, as previously mentioned, the 
MLE has long been known to be inadmissible (e.g.. Sacks, 1963) under squared 
error loss. Despite the age of this finding, it was not until the paper of Shao and 
Strawderman (1996a) that explicit improvements were obtained (under squared 
error loss), and there still remains the open question of finding admissible improve- 
ments. As well, Katz’s (1961) uniform Generalized Bayes estimator remains (to our 
knowledge) the only known minimax and admissible estimator of 9 (under squared 
error loss). 

5. Estimating parameters with additional information 

In this section, we present a class of interesting estimation problems which can 
bo transformed to capitalize on standard solutions for estimation problems in con- 
strained parameter spaces. The key technical aspect of subdividing the estimation 
problem into distinct pieces that can be handled separately is perhaps due to the 
early work of Blumenthal, Cohen and Sackrowitz. As well, these types of problems 
have been addressed in some recent work of Constance van Eeden and Jim Zidek. 

Suppose Xj\ j = 1,2; are independently distributed as Np{9j,(jjlp), with p > 1 
and known <Ti, Consider estimating 9i under squared error loss L{9i,d) = 
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l|d — 01 p with the prior information 0i — 62 ^ A\ A being a proper subset of 
For instance, with order restrictions of the form 0i,j > Oo,i,i = we would 

have A = Heuristics suggest that the independently distributed X2 can be 

used in conjunction with the information 0i — 62 € A to construct estimators that 
improve upon the unrestricted MLE (and UMVU estimator) So{Xi,X2) = Xi. For 

instance, suppose w 0, and that A is convex. Then, arguably, estimators of 

9\ should shrink towards A -\- X2 = {0\ ; 0i — X2 6 ^4}. The recognition of this 
possibility (for p = I and A = (0, 00)) goes back at least as far as Blumenthal and 
Cohen (1968a), or Cohen and Sackrowitz (1970); and is further discussed in some 
detail by van Eeden and Zidek (2003). 

Following the rotation technique used by Blumenthal and Cohen (1968a), Cohen 
and Sackrowitz (1970), van Eeden and Zidek (2001,2003) among others, we ilhustrate 
in this section how one can exploit the information 9i — 02 € A] for instance to 
improve on the unrestrictetl MLE (5o(Xi, X2) = Xi. It will be convenient to define 
Cl as the following subclass of estimators of 0i: 


Definition 1. 
Cl 


= |(i^:<J^(Xi,X2) = r2 + 0(V'i), 

with 


,, X, -X2 ,, rXi + X2 
Yl = — ; , 12 = — : 1 and T 


1 + r 


1 +r 




Note that the above defined Yi and V2 are independently normally distributed 
(with £|y,l = II, = B|y2] = Cov(Y,) = and CoviYi) = 

Given this independence, the risk function of 6,p (for 0 = (0i,02)) becomes 


R($,S4X,,X2)) 


£« [ 11^2 + 0(^1 




Ee[\\Y2-p2f] + Ee[mYi)-tnf]. 


- 0 ., 


+ T 


Therefore, the performance of Sif,{X i,X2) as an estimator of 0i is measured 
solely by the jjcrformance of 0(Fi) as an estimator of p\ under the model Yi ^ 

g2 

^p(/^i5 j:^^p)i with the restriction p\ E C = {y : (1 +r)y € A}. In particular, one 
gets the following dominance result. 


Proposition 1. For estimating 9i under squared error loss, unth 9\ — 02 € A, the 
estimator 6^^{X\, X2) will dominate 6,p^{Xi, X2) if and only if 

- /‘if]. 


for Pi ^ C (with strict inequality for some pi). 


We pursue with some applications of Proposition 1, which we accompany with 
various comments and historical references. 


(A) Case wheiv. <j0 o(Xi,X 2) = Xi {i.e., the unrestricted mle of 0i), and where A 
is convex with a non empty interior. 

This estimator arises as a member of Ci for 0o(Ti) = Yi. Hartigan’s result 
(Theorem 1) applies to <f>o{Yi) (since A convex implies C convex), and tells 
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iLs that the Bayes estimator 4>Uci^^) of respect to a uniform prior on 

C dominates <l>o{Yi) (under squared-error loss). Hence, Proposition 1 applies 
with 4>i = <j>Uc^ producing the following dominating estimator of ^ 2 ): 

■S«„<,(A-,,X2) = K2 + </.(;<,(y,)- (4) 

For p = 1 and A = (— m, m], the dominance of 6^^ by the estimator given in 
(4) was established by van Eeden and Zidek (2001), while for p = 1 and A = 
(0,oo) (or A = (— oo,0|), this dominance result was established by Kubokfiwa 
and Saleh (1994). In both cases, Kubokawa’s lERD method, as presented in 
Section 3, was utilized to produce a class of dominating estimators which 
includes X 2 ). As was the case in Section 2, these previoiusly known 

dominance results yield extensions to sets A which are hyperrectangles or 
intersection of half-spaces, but Hartigan’s result yields a much more general 
result. 

Remark 4. Here are some additional notes on previous results related to the case 
p = 1 and A = |0,oo). Kubokiiwa and Saleh (1994) also provide various extensions 
to other distributions with monotone likelihood ratio and to strict bowl-.shaped 
Io.s.ses, while van Eeden and Zidek (2003) introduce an estimator obtained from a 
weighted likelihood perspective and di.scu.ss its performance in comparison to several 
others including 5^y^{Xi,X2). The admis.sibility and minimaxity of X2) 

(under squared error loss) were established by Cohen and Sackrowitz (1970). Fur- 
ther research concerning this problem, and the related problem of estimating jointly 
0\ and 02, has appeared in Blumenthal and Cohen (1968b), Brewster and Zidek 
(1974), and Kumar and Sharma (1988) among many others. There is equally a sub- 
stantial body of work concerning estimating a parameter f)i (e.g., location, .scale, 
discrete family) under various kinds of order restrictions involving k parameters 
01, - • •■,0k (other than work referred to elsewhere in this paper, see for instance van 
Eeden and Zidek, 2001, 2003 for additional annotated references). 

Another dominating estimator of S^„{X\, X2) = X\, which may be .seen as a 
consequence of Proposition 1, is given by 6^^{Xi, X2) = ¥2 + 0mle(^i)’ 
0mle(^i) € C. This is so because, as remarked upon in Sec- 

tion 2, (^i(V'i) = <?nile(^i) dominates under squared error loss, as an estimator of 
€ C; (f>o{Yi) = Yi. Ob-serve further that the maximum likelihood estimator 
(5jnig(Xi, A 2 ) of 01 for the parameter space 0 = {{0\,02) '-0\ — 02 ^ A,t0\ + 02 & 

3?''} is indeed given by: A 2 ) = (/^2)„lle + (/^i)mle = ^2 + 0mle(^>)’ S«ven 

the independence and normality of Y\ and Y2, and the fact that V 2 is the MLE of 

H2 (P2 e SR'O. 

Our next two applications of Proposition 1 deal with the estimator (Jj^|g(Xi , X2). 

(B) Case where A is a ball and S^AX,,X2)^S„^l^(XuX2). 

For the case where 0\ — 02 € A, with A being a p-dimensional ball of radius 
m centered at 0, the e.stimator <5j^|g(Ai, X 2 ) arises as a member of C\ for 
0o(Ti) = 0mie(^i) = (ll^'ill A By virtue of Proposition 1, it follows 

that dominating estimators 0,(Ti) of ^nile^^*) Wl^^W — 

m/(l -hr)), such as those given by Marchand and Perron (2001) (.see Section 
4.3 above), yield dominating estimators <5,^. (Ai, X 2 ) = 

«f <!>tnle(^i’ ^^ 2 )- 
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(C) Case where A = [0, oo), and X 2 ) = Sjj^i^{X\, X 2 ). This is similar 

to (B), and dominating estimators can be constructed by using Sliao and 
Strawderman’s (1996) dominating estimators of the MLE of a positive normal 
mean (see van Eeden and Zidek, 2001, Theorem 4.3) 

Observe that results in (B) (for p = 1) and (C) lead to further applications of 
Proposition 1 for sets A which are hyperrectangles or intersection of half-spaces. We 
conclude by pointing out that the approach used in this section may well lead to new 
directions in future research. For instance, the methods used above could be lusetl 
to specify dominating estimators for the case p > 3, (of ^2) = V2 -I- 0o(Ti)) 

of the form <^(>2) + 0i(Fi) where, not only is <f>i{Y\) a dominating estimator of 
0o(Ti) for Pi € C, but for p > 3, ^2(^2) is a Stein-type estimator of p2 which 
dominates Y 2 . 

6. Minimax estimation 

This section presents an overview of minimax estimation in compact parameter 
spaces, with a focus on the case of an interval constraint of the type 6 G [a, 6] and 
analytical results giving conditions for which the minimax estimator is a Bayesian 
estimator with respect to a boundary prior on {a, 6}. Historical elements are first 
described in Section 6.1, a somewhat novel expository example is presented in 
Section 6.2., and we further describe complementary results in Section 6.3. 

6.1. Two point least favourable priors 

With the criteria of minimaxity playing a vital role in the development of statis- 
tical theory and practice; as reviewed in Brown (1994) or Strawderman (2000) for 
instance; the results of Casella and Strawderman (1981), as well as those of Zinzius 
(1981) most certainly inspired a lot of further work. The.se results presented ana- 
lytically obtained minimax estimators, under squared error loss, of a normal model 
mean 0, with known variance, when 0 is known to be restricted to a small enough 
interval. More precisely, Casella and Strawderman showed, for X ~ N{6,\) with 
6 0 = [— m, m]; (there is no loss of generality in assuming the variance to be 
1, and the interval to be .symmetric about 0); that the uniform boundary Bayes 
estimator Ssi/(x) = mtanh(mar) is unique minimax iff rn < mo « 1.0567. Fur- 
thermore, they also investigated three-point priors supported on {— m, 0, m}, and 
obtained sufficient conditions for such a prior to be least favourable. It is worth 
mentioning that these results give immediately minimax multivariate extensions to 
rectangular constraints where Xi ~ N{9i,\)\i = l,...,p; with |0j| < m, < rrio, 
under lo.s.ses uJi{di (with arbitrary positive weights u;i), since the least 

favourable prior for estimating (^i, . . .,9p) is obtained, in such a case, as the prod- 
uct of the least favourable priors for estimating 6\,...^dp individually. Now, the 
above minimaxity results were obtained by using the following well-known crite- 
ria for minimaxity (e.g., Berger, 1985, Section 5.3, or Lehmann and Casella, 1998, 
Section 5.1). 

Lemma 1. If 6-„ is a Bayes estimator with respect to a prior distribution n, and 
S„ = {6 €. S : sup0{/?(^, <J„)} = R{$,S„)}, then Sn is minimax whenever P„{6 G 
S^) = 1 . 

Casella and Strawderman’s work capitalized on Karlin’s (1957) sign change ar- 
guments for implementing Lemma 1 while, in contra.st, the sufficient conditions 


Copyrighted material 


34 


E. Marchand and W. E. Strawderman 


obtained by Zinzius concerning the minimaxity of 5bu{^) were established us- 
ing the “convexity tec;hniqiie” as stated as part (b) of the following Corollary to 
Lemma 1. Part (a), introduced here as an alternative condition, will be used later 
in this section. 

Corollary 1. If is a Bayes estimator with respect to a two-point prior on {a,b} 
such that R{a, S^r) = R{b, S„), then S,r is minimax for the parameter space 0 = (a, 6] 
whenever, as a function of 9 ^ [a,&]j 

(a) ^R{6,6^) has at most one sign change from — to +; or 

(b) R{9,S,r) is convex. 

Although the convexity technique applied to the bounded normal mean problem 
gives only a lower bound for mo; (Bader and Bischoff (2003) report that the best 

known bound using convexity is as given by Bischoff and Fieger (1992)); it 
has proven very useful for investigating least favourable boundary supported priors 
for other models and loss functions. In particular, DasGupta (1985) used subhar- 
monicity to establish, for small enough compact and convex parameter spaces un- 
der squared error loss, t he inevitability of a boundary supported least favourable 
prior for a general class of univariate and multivariate models. As well, the work 
of Bmler and BLschoff (2003), Boratyhska (2001), van Eeden and Zidek (1999), 
and Eichenauer-Hermann and Fieger (1992), among others, establish this same 
inevitability with some generality with respect to the loss and/or the model. Cu- 
riously, as shown by Eichenauer-Hermann and Ickstadt (1992), and Bischoff and 
Fieger (1993), there need not exist a boundary least favourable prior for convex, 
but not strictly convex, losses. Indeed, their results both include the important case 
of a normal mean restricted to an interval and estimate<l with absolute-value loss, 
where no two-point least favourable prior exists. 

6.2. Two-point least favourable priors in symmetric location families 

We present here a new development for location family densities (with rtispect to 
LeU^sgue measure on 3?^ ) of the form 

fe{x) = with convex and symmetric li. (5) 

These assumptions on h imply that such densities fo are unimodal, symmetric 
about 0, and possess monotone increasing likelihood ratio in X. For estimating 
0 with squared error loss under the restriction 6 € [— m, m], our objective here 
is to present a simple illustration of the inevitability of a boundary supported 
least favourable prior for small enough m, i.e., m < mo{h). Namely, we give for 
densities in (5) with concave h'{x) for x > 0 (this implies convex /^'(x) for x < 0) a 
simple lower bound for mo{h). We pursue with two preliminary lemmas: the latter 
one giving simple and general conditions for which a wide subcla.ss of symmetric 
estimators (i.e., equivariant under sign changers) <5(A’) of 9 have increasing risk 
/?(0, <5(X)) in |<?| under squared error loss. 

Lemma 2. If g is a bounded and almost everywhere differentiable function, then 
under (5): 

f^E,\g(X)] = Eo[g'(X)\. 

Proof. First, interchange derivative and integral to obtain -^EelgiX)] = 
Eo \g{X) h'{X — 0)]. Then, integrating by parts yields the result. □ 
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Lemma 3. For models in (5), and estimators <5(X) with the properties: (a) 5(x) = 
—S{—x); (b) S'(x) > 0; and (c) S'(x) decreasing in jx|; for all x E 3?; either one of 
the following conditions is sufficient for /2(0, 5(X)) to be increasing in \6\; |0| < m; 

(i) Eo[6{X)] <e{l- for0<e< m; 

(ii) Ee[S{X)] < 61(1 - (5'(0)), forO<0< m; 

(iii) m < I 

Proof. It will suffice to work with the condition ■^R{$,6{X)) > 0, 0 < 0 < ;n, 
since R(0,S(X)) is an even function of 0, given property (a) and the symmetry 
of h. Differentiating directly the risk and using Lemma 2, we obtain 

~R(«,S(X)) = 9-E4S(X)i-eE4S'(X)]+EalS(X)S'(X)]. 

With properties (a) and (b), the function S(x)S'(x) changes signs once (at x = 0) 
from — to +, and, thus, sign change properties under h imply that £’0[<5(X)^'(X) ] 
changes signs at most once from — to + as a function of 0. Since £o(<5(X)5'(X)] = 0 
by symmetry of ^(x)5'(x) and h, we mast have E0[(5(X)<5'(X)] > 0 for ^ > 0. It 
then follows that 


^~R(0,i(X))>e-E4S(X)]-eEolS'(X)]; 

and this yields directly sufficient condition (i). Now, property (c) tells us that 
S'(x) < <5'(0), and this indicates that condition (ii) implies (i), hence its sufficiency. 
Finally, condition (iii) along with Lemma 2 and the properties of S(X) implies (ii) 
since ^Eo[S(X) + 0(<5'(O) - 1)) = EelS'(X) + (5'(0) - 1)] < E<,[2(5'(0) - 1] < 0, and 
F:46(X) + 0(<5'(O)-1)]10=„ = O. 

We pursue by applying Lemma 3 to the case of the boundary uniform Bayes 
estimator 5bu{.X) to obtain, by virtue of Corollary 1, part (a), a minimaxity result 
for 6bu{X). □ 

Corollary 2. For models in (5), Sbu(X) is a unique, minimax estimator of 0 {under 
squared error loss) for the parameter space (— m, m) when either one of the following 
situations arises: 


(A) Condition (i) of Lemma 3 holds; 

(B) h'{x) is concave for x > 0, and condition (ii) of Lemma 3 holds; 

(C) h'{x) is concave for x > 0, and in < m*(h) where in*{h) is the solution in m 
of the equation mh'{m) = 

Proof. We apply Corollary 1, part (a), and Lemma 3. To do so, we need to in- 
vestigate properties (b) and (c) of Lemma 3 for the estimator Sbu{X) (proj> 
erty (a) is necessarily satisfied since the uniform boundary prior is symmetric). 
Under model (5), the Bayes estimator Sbu{X) and its derivative (with respect 
to X) may be expressed as: 


Sbu(x) 


me — me / fi(x -f- m) — h(x — m) \ 

n ^ ^ , — = m tanh ; 

g-/»(i-m) _|_ g-/i(x+in) 2 J 


(6) 
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and 

r/ / \ f 2 f / .9 1 f “ "0 1 

^Buix) = (7) 

Observe that |<Jm/(x)| < m, and h'{x + m) > h'{x — m) by the convexity of h, so 
that S\jij{x) > 0 given (7). This establishes property (b) of Lemma 3, and part (A). 
Now, rn^ — Sj^fj{x) is decreasing in |x|, and so is h'{x + m) — h'{x — m) given, namely, 
the concavity of h'{x) for x > 0. This tells us that Ssa(X) verifies property (c) 
of Lemma 3, and (B) follows. Hence, condition (iii) of Lemma 3 applies becoming 
equivalent to mh'(m) < ^, as <5'(0) = mh'(m) by (7). Finally, the result follows by 
the fact that inh'{m) is a continuous and increasing of function of m, m > 0. □ 

Remark 5. As the outcome of the above argument, combining both sign change 
arguments and convexity considerations, containing other elements which may be 
indeiiendent interest, part (C) of Corollary 2 gives a simple sufficient condition 
for the minimaxity of Suu, and is applicable to a wide chuss of models in (5). 
Namely, for Exponential Power families where, in (5), h{y) = with a > 0 
and 1 < /3 < 2, part (C) of Corollary 2 applies, and tells us that 6bu{^) (which 
may be derived from (6)) is unique minimax whenever m < m*{h) = In 

particular for double-exponential cases, (i.e., 0 = 1), we obtain m*(/i) = and 

for the standard normal case, (i.e., (a,/?) = (|,2), we obtain m*{h) = Observe 
that the normal case m*{h) matches the one given by Bi.schofT and Fieger (1992); 
and that, as expected with the various lower bounds u.sed for the derivative of the 
risk, it falls somewhat below Casella and Strawderman’s necessary and sufficient 
cutoff point of mo w 1.05674. 

6.3. Some additional results and comments 

The problem considered in Section 6.2, was studied also by Eichenauer-Hermann 
and Ickstadt (1992), who obtained similar results using a convexity argument for 
the models in (5) with U’,p > 1 loss. Additional work concerning least favourable 
boundary priors for various models can be found in: Moors (1985), Berry (1989), 
Eichenauer (1986), Chen and Eichenauer (1988), Eichenauer-Hermann and Fieger 
(1989), Bischoff (1992), Bischoff and Fieger (1992), Berry (1993), Johnstone and 
MacGibbon (1992), Bischoff, Fieger and Wurlfert (1995), Bischoff, Fieger, and 
Ochtrop (1995), Marchand and MacGibbon (2000), and Wan, Zou and Lee (2000). 

Facilitated by results guaranteeing the existence of a least favourable prior sui> 
ported on a finite number of points (e.g., Ghosh, 1964), the dual problem of searcJi- 
ing numerically for a least favourable prior tt, as presented in Lemma 1, is very 
much the standard approach for minimax estimation problems in compact parame- 
ter spaces. Algorithms to capitalize on this have been presented by Nelson (1965), 
and Kempthorne (1987), and have betni implemented by Marchand and MacGibbon 
(2000), for a restricted binomial probability parameter, MacGibbon, Gourdin, Jau- 
mard, and Kempthorne (2000) for a restricted Poisson parameter, among others. 
Other algorithms have been investigated by Gourdin, Jaumard, and MacGibbon 
(1994). 

Analytical and numerical results concerning the related criteria of Gamma- 
Minimaxity in constrained parameter spaces have been addressed by Vidakovic 
and Da.sGupta (1996), Vidavovic (1993), Lehn and Rummel (1987), Eichenauer 
and Lehn (1989), Bischoff (1992), Bischoff and Fieger (1992), Bischoff, Fieger and 
Wurlfert (1995), and Wan et al. (2000). 
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For spherical bounds of the form ||0|| < m. Berry (1990) generalized Casella 
and Strawderman’s ininimaxity of Sbu result for multi\'ariate normal models X ~ 
Np{6,Ip). He showed with sign change arguments that 6bu 's unique minimax for 
m < mo(p)t giving defining equations for rrio(p)- Recently, Marchand tmd Perron 
(2002) showed that rno(p) > y/p, and that mv(p)/y/p 1.15096 for large p. For 
larger m, least favourable distributions are mixtures of a finite number of uniform 
distributions on spheres (see Robert, 2001, page 73, and the given references), 
but the numlrer, position and mixture weights of these spheres require numerical 
evaluation. 

Early and significant contributions to the study of minimax estimation of a nor- 
mal mean restricted to an interval or a ball of radius m, were given by Bickel (1981) 
and Levit (1980). These contributions consi.sted of approximations to the minimax 
risk and least favourable prior for large m under .squared error loss. In particular, 
Bickel showed that, as m — * oc, the least favourable distributions rescaled to [—1, 1] 
converge weakly to a distribution with density cos^(trj/2), and that the minimax 
risks behave like 1 — -l-o(m“^). Extensions and further interpretations of these 

results were given by NIelkman and Ritov (1987), Gajek and Kaluszka (1994), and 
Delampady and others (2001). There is also a substantial literature on the com- 
parative efficiency of mimimax procedures and affine linear minimax estimators for 
various models, restricted parameter spaces, and loss functions. A small sample of 
such work includes Pinsker (1980), Ibragimov and Hasminskii (1984), Donoho, Liu 
and MacGibbon (1990), and Johnstone and MacGibbon (1992.1993). 
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Abstract: We generalize a set of axioms introduced by Rubin (1987) to the 
rase of partial preference. That is, we consider cases in which not all uncertain 
acts are comparable to each other. We demonstrate some relations lietween 
these axioms and a decision theory based on sets of probability/utility pairs. 

We illustrate by example ho^^• comparisons solely l>etwt?en pairs of acts is not 
sufRcient to distinguish between decision makers who base their choices on 
distinct sets of probability/utility pairs. 

1. Introduction 

Rubin (1987) presented axioms for rational choice amongst sets of available actions. 
These axioms generalize those of Von Neumann and Morgenstern ( 1947) which deal 
solely with comparisons between jjairs of actions. Both of these sets of axioms imply 
that all actions that are choosable from a given set are equivalent in the sense that 
the rational agent would be indifferent between choosing amongst them. We weaken 
the axioms of Rubin (1987) by allowing that the agent might not be able to choost* 
between actions without being indifferent between them. 

There are several reasons for allowing noncomparbility (unwillingness to choose 
without being indifferent) between actions. One simple motivation is a consideration 
of robustness of decisions to changes in parts of a statistical model. For example, 
consider an estimation problem with a loss function but several competing nuxiels 
for data and/or parameters. We might Ire interested in determining which esti- 
mators can be rejected in the sense that they do not minimize the expected lixis 
under even a single one of the competing models. The agent may not Ire indifferent 
between the estimators that remain without being able to select a best one. 

With regard to sets of choices, (Rubin, 1987, p. 49) .says “The basic concept Ls 
that of a choice set. This Ls a set of actions that will be chosen by decision maker; 
we do not assume the decision maker can select a unique action.” Nevertheless, the 
axioms of Rubin (1987) lead to a unique (up to positive affine transformation) utility 
that ranks all actions, just as do the axioms of Von Neumann and Morgenstern 
(1947). The weakening of the axioms that we present here is consistent with a 
set of utilities combined through a Pareto-style criterion, which we introduce in 
Section 3. 

2. Comparison of axioms 

Initially, we consider a nonempty convex collection A of acts. In particular, for every 
Zi,i 2 & A and every 0 < a < 1, axi + (1 - a)x 2 € A. As such, the set of acts must 
lie in some part of a space where convex combination makes sen.se. Typically, we 
think of acts either as probability distributions over a set 77 or as functions from 
some other set fl to probability distributions on 77. These interpretations make 
convex combination a very natural operation, but the various axiom systems and 
the related theorems do not rely on one particular class of interpretatioiLs. 
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The classic axioms of Von Neumann and Morgenstern (1947) are the following. 

Von Neumann-Morgenstern Axiom 1. There exists a weak order ■< on A. 
That is, 

• for every x € A, x ^ x, 

• for every x,y G A, either x ^ y, or y X x, or both, and 

• for all x,y,z G A, if x ^ y and y < z, then x ■< z. 

In the case in which x ■< y and y x, then we say x ~ y. 

Von Neumann-Morgenstern Axiom 2. For all x,y,z G A. x ■< y \i and only 
if for all 0 < a < 1 ax + (1 — a)z :< ay + (1 — a)z. 

Von Neumann-Morgenstern Axiom 2 is the most controversial of the classic 
axioms. Its appeal stems from the following scenario. Imagine that a coin (inde- 
pendent of everything else in the problem) is flipped with probabilitj' a of landing 
heads. If the coin lands heads, you must choose between x and y, otherwise, you 
get z. Presumably the choice you would make between x and y would be the same 
in this setting as it would be if you merely had to choose between x and y without 
any coin flip. The controversy arises out of the following scenario. The coin flip that 
determines which of x or z arises from ax 4- (1 — a)z can be difTerent (although with 
the same probability) from the coin flip that determines which of y or 2 arises from 
ay 4- (1 — a)z. From a minimax standpoint, the first scenario can lead to a different 
choice between ax 4- (1 — a)z and ay 4- (1 — a)z than does the second scenario. 

Von Neumann-Morgenstern Axiom 3. For all x, y, 2 € «4, if x y 2 , then 
there exists 0 < a < 1 such that y ~ ox 4- (1 — a)z. 

Von Neumann-Morgenstern Axiom 3 prevents any acts from being worth in- 
finitely more (or infinitesimally less) than other acts. Under these axioms. Von 
Neumann and Morgenstern (1947) prove that there exists a utility f/ : A — » R 
satisfying 

• for all X, y € A. X ^ y if and only if U{x) < U{y), 

• for all X, y G A and 0 < a < 1, U(ax 4- (1 — o)y) = aU{x) 4- (1 — a)U{y), and 

• U IS unique tip to positive affine transformation. 

The axioms of Rubin (1987), which we state next, make u.se of the convex hull 
of a set E C A which is denoted H{E). Rubin (1987) was particularly concerned 
with the idea that, when presented with a set E of actions, the agent might insist 
on randomizing between actions in E rather than selecting an action from E itself. 
This is why the choice set from E is a suKset of H{E). 

Rubin Axiom 1. There is a function C : 2^ —* 2^ that satisfies 

• for all E g2^ C(E) C H{E), and if E has 1, 2, or 3 elements then C(E) ^ 0. 

The set C in Rubin Axiom 1 can be thought of as a generalization of the 
weak order :< from Von Neumann Morgenstern Axiom 1: x :< y if and only if 
y € C({x,y}). 

Rubin Axiom 2. For all T,S G 2-^, if T C H{S) and H{T) nC{S) ^ 0, then 
C(T) = H(T) n C{S). 
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Rubin Axiom 2 says that if an act is choosable from a large set, then it remaims 
choosable from any smaller set that contains it. 

U S C A, X e A, and 0 < a < 1, define o5 + (1 — a)x = {nj/ + (l— a)i:ye5}. 

Rubin Axiom 3. For all 5 C A and all 0 < a < 1, if C{S) ft 0, then C(aS + 
(1 — a)x) = aC(S) + (1 — a)x. 

Rubin Axiom 3 is the obvious analog to Von Neumann -Morgenstern Axiom 2. 

Rubin Axiom 4. Let S C A and x e H(S). If. for all V C H(S), {x e V and 
C(V) ft 0) implies x e C(V), then x 6 C(S). 

Rubin Axiom 4 says that, if an act is not choosable from S, then it is not 
choosable from some subset of S. 

Rubin Axiom 5. Let x,y,z e A be such that C({i,y}) = {i} and C({y, 2 }) = 
{j/}. Then there exists 0 < a < 1 such that {j/, ai+(l — 0 ) 2 } C C({j/, ni + (l - 0 ) 2 }). 

Rubin Axiom 5 is an obvious analog to Von Neumann Morgenstern Axiom 3. 
Under these axioms, Rubin (1987) proves that there exists a utility (7 : A — > R 
satisfying 

• for all EC A, C(E) = € H{E) : for all y e E, U(x) > U(y)}, 

• for all I, y e A and 0 < a < 1, U(ax + (1 — a)y) = aU(x) + (1 — a)U(y), and 

• f/ is unique up to positive affine transformation. 

It is fairly simple to show that, if such a U exists, then all of Rubin’s axioms hold. 
Hence, his result is that his axioms characterize choice sets that are related to utility 
functions in the way descrilxxi by the three bullets above. 

In order to allow noncomparability, we new! more general axioms than Von 
Neumann- Morgenstern Axiom 3 and Rubin Axiom 5. To state the more general 
axioms, we need a topologj’ on the set of actions. For now, assume that the .set of 
acts A is a metric space with some metric d. When we consider specific examples, 
we will construct the metric. Let T be the collection of nonempty closed sulrsets 
of A. 

We prefer to state our axioms in terms of a rejection function rather than a 
choice set function. 

Definition 1. A rejection set f? is a function R : E 2'* such that, for all E ^ E, 
R{E) C E, and R{E} ft E. 

Axiom 1. If R C R{A) and if A C D, then B C R{D). 

Axiom 1 is the same as Sen’s property q. (.see Sen (1977)). It says that adding 
more options to a set of acts doesn't make the rejected ones become acceptable. 

Axiom 2. If R is a sulrset of R(A) and if R is a subset of R, then B\D C R(A \ D). 

Axiom 2 says that rejected acts remain rejected even if we remove other rejected 
acts from the option set. 

Definition 2. For A.B € E, say that A R if A C R(A U R). 

Lemma 1. Assume Axiom I and Axiom 2. Then -< is a strict partial order on E . 
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Proof. Let A, B € l( A ^ B, then B -A A because A and B being closed implies 
that R{A U B) ^ d U /?. For transitivity, assume that A < B and B D with 
A,B,De T. Then A C R(AllB) C R(AV Bu D), by Axiom 1. Also Axiom 1 says 
that B C R{B U D) C R(A U B U D). It follows that A U B C R{A U B U D). Let 
E = B\A. Then 

A = (A U B) \ B C B((AUBUB1\B) C B(A U D), 

where the first inclusion is from Axiom 2 and the second is from Axiom I. □ 

Our next axiom is similar to Rubin Axiom 3. 

Axiom 3. For all B € all x € A, and all 0 < a < 1, B = B(B) if and only if 
aB + (1 — a)i = R{aE + (1 — a)x). 

For the continuity axiom, we require the concept of a sequence of sets that are 
all indexed the same way. 

Definition 3. Let (7 be an index set with cardinality less than that of A. Let H 

be another index set. Let H = {B/, : h € //} be a collection of subsets of A. We say 

that the sets iti H are indexed in common by G if For each h & H and each g ^G, 
there exists Xh.g € A such that B/, = {xh.g ■ g € G}. 

Axiom 4. Let Ga and Gu be index sets with cardinalities less than that of A. Let 
{ A„}^i be a sequence of elements of B such that each A„ = {x„,j ; g € G,^}. Also, 
let {B„}^i be a sequence of elements of B such that each B„ = {in.s : g € Gg). 
Suppose that for each g ^ GaD Gg, x„.g ^ Xg & A. Let A = [xg : g e Ga} and 
B = {xg : g € Gg]. Let N and J be closed subsets of A. 

• If Vnfl„ -< A„ and A -< N, then B -< N. 

• If VnB„ X A„ aiul ./ x B, then .7 x A. 

The reason for wording Axiom 4 with the additional sets N and J is that the 
acts in B and A might be noncomparable when compared to each other becattse the 
limit process brings B„ and A„ so close together. But the axiom says that the limit 
of a se<nience of rejectetl optioiLS can't jump over .something that Is better than the 
limit of choosable options. Similarly, the limit of a sequence of choosable options 
can’t jump Irelow .something that is worse than the limit of rejected options. 

We state one additional axiom here that is neces.sary for the generalization that 
we ho[)e to achieve. Rectdl that 77(B) is the convex hull of the set B. 

Axiom 5. For each B G B and B C B, if B C R(H(E)), then B C R[E). 

Axiom 5 says that if acts are rejected when the closed convex hull of B is 
available then they miLst also be rejected when B alone is available. Closing the 
convex hull of a closed set of acts should not allow us to reject any acts that we 
couldn't reject before. 

3. Pareto Criteria 

After proving the existence of the utility, Rubin (1987) considers ca.ses with many 
utilities indexed by elements of some set $1. He then says (p. 53) “Two immediate 
examples come to mind: il may be the class of states of nature, or Q may be the 
set of all individuals in a population. Suppose we a.ssume that the choice process 
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given u} is ‘reasonable’ for each uj € Q, and the overall process is also reasonable.” 
The first of the two examples envisioned by Rubin (1987) is the usual case in which 
there is uncertainty about unknown events. The second example is the case in 
which the “overall proces.s” is governed by a social welfare function. Our approach 
is motivated by an alternative way of thinking about individuals in a population. 
Instead of a social welfare function that performs just like an individual’s utility, 
we seek a characterization of the agreements amongst the individuals. 

Definition 4. Let K be a set. For each q € K, let R,^ \ T 2'^ be a rejection func- 
tion. The Pareto rejection function related {/?„ : q € R} is R{E) = Haca ^<»(^) 
for all E ^ !F. 


In this definition, an act x is Pareto rejected by the group R if it is rejected by 
every member of the group. The complement of the Pareto rejection function might 
be called the Pareto choice function C : E 2^ defined by C{E) = [/?(£7)]^. This 
is the set of acts that fail to be rejected by at least one individual in R. 

The general example that motivates our work is the following. Let be a finite 
set of states. For each o € R let be a probability on il. Let acts be functions from 
n to probability measures over some finite set of prizes R. That is, let Vn be the 
set of probability measures over R so that each act x € >4 is a function x :il —* Vn 
and x(o;)(r) is the probability of prize r in state u}. For each a € K, let there be a 
bounded possibly state-dependent utility t/f>(-|u;), appropriately measurable. Define 
V'a : ^ > K by 


V'o(x) = Y. 


Y t'a(r|w)i(uJ)(r) 


w€Jl Lr67i 


Pa(o;). 


Next, define 


Ca{E) = {x e E : V„{x) > Va{y), for all y € E}, 

and Rn{E) = (Ca(P)]^. Hence, Cq is the set of all Bayes rules in the model with 
utility Un and probability P^. Then R{E) = floeN f^n{E) is the set of all acts x 
such that, for every model in R, x fails to be a Bayes rule. VV’e call this rejection 
function the Bayes rejection function related to {{P„,Ua) ■ Q € R}. Finally, we 
define the metric on A. All acts are equivalent to points in a bounded subset of a 
finite-dimensional Euclidean space. If s is the number of states and t is the number 
of prizes, then act x is equivalent to an s x t matrix with (i,j) entry equal to the 
probability of prize j in state i. We will use the msual Euclidean metric as d. It is 
now easy to see that V„ is a continuous function of x for each o. 

Lemma 2. If B < A, then, for each a € R, there is y & A\B such that Va{y) > 
sup*ecK,(2). 

Proof. We can think of P as a clo.sed and bounded subset of a finiUvdimensional 
Euclidean space. For each q € R, is continuous, hence there exists x € P 
such that Va(x) = sup^g^ K,(z). Since x € P, there exists y € .4 U P such that 
Va{x) < V{y)- By the definition of x it is clear that y € A \ B. □ 

Lemma 3. The Bayes rejection function related to {{Pa,Ua) : a € R} satisfies 
Axiom 1. 

Proof. Let A € .P and P C R{A) and A C D. If x € P, then for each a € R, 
there is yo € A such that K»(x) < K»(2/a)- Since y„ £ D for all a, it follows that 
X 6 R(D). □ 
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Lemma 4. The Bayes rejection function related to {(Pa,U„) : q € K} satisfies 
Axiom 2. 

Proof. Let B be a closed subset of R(A) and let D C B. Let x & B \ D. Since 
X € B, for every a e K, there exists y„ ^ A \ B such t hat V'a(x) < K«(!/n) by 
Lemma 2. Since y„ e A\D as well, we have i e R[A \ D). □ 

Lemma 5. The Bayes rejection function related to {(Pa, Pa) : Q € H} satisfies 
Axiom 3. 

Proof. The “iP direction is trivial because a = 1 Ls included. For the “only iP 
direction, let0<a< I, x e A and E € P. First, we show that R(aE + (1 - a)x) C 
aR(E) + (1 — a)i. Let ^ € R(aE + (1 —a)x). Express 2 = oi/-(- (1 —a)x, with y & E. 
For every o 6 K there is 2 „ = ay„ + (1 — o)x with ya ^ E and Va(Za) > Vai^)- 
This implies P„(yn) > Va(y) and y € R(E), so 2 € aR(E) + (1 — a)x. Finally, let 
2 e aR(E) + (1 — a)x, and express 2 = ay + (1 — a)x, with y G R(E). For every 
o e N, there is y„ G B such that V'n(ya) > Va(y) so that Vc(ay + (1 —a)x) > Va(z). 
It follows that 2 G R(aE + (1 — a)x). □ 

Lemma 6. The Bayes rejection function related to {(Pa,Ua) : a € H} satisfies 
Axiom 4. 

Proof. Assume B„ ■< A„ for all n. Let g E Gb and q G K. For each n, there is 
hn.g e Ga such that Pa(Xn,j) < V'a(Xn,h„,,). By Continuity of Va, we have 

V’o(xg) < Uminf Vo(x„,h, ) < sup V'o(x/,) < sup V„(i). 

" i>eGA 

Becattse K, is continuous and A is a closed and bounded subset of a Unite-dimensional 
Euclidean space, there exists y G 3? such that K>(y) = sup^g^j K»(x). It follows that 

sup Va(Xg) < V'o(y). (1) 

For the first line of Axiom 4, assume that A -< N. For each g G Gb and each 
a G K, we need to find z e BU N such that K»(Xj) < Va(z). Let y be as in (1). 
Because A -< N, there is z & N \ Bu N such that Va(z) > Va(y) > Va(xg). 

For the second line of Axiom 4, assume that J < B. For each x € J and each 
o G K, we need to find y € J U A such that Pa(x) < K»(y). Let x € J and a G N. 
By Lemma 2 there is Xg £ B\J such that Va(x) < Va(Xg). Let y be as in (1). Since 
y G ^ C J u we are done. □ 

Lemma 7. The Bayes rejection function related to {(Pa,Ua) : a G N} satisfies 
.4iJom 5. 

Proof. Let E € E and BCE. Assume that B C R(H(E)). Let i G B. For each 
a G K, we know that there exists z„ € H(E) such that V„(x) < K»(2a). This 2 q 
is a limit of elements of H(E) and Va is continuous, hence there is a u>a 6 H(E) 
such that V„(i) < V„(u!„). This w„ is a convex combination of elements of E, 
w„ = <ZiWi,a with Wi,a € E and 53j=i “t = 1 with all > 0. Since 

t 

Va(u>a) = ^OiV„(u;j,„) > V„(l), 

•=l 


there must exist i such that > V>(i). Let y„ = tCi.a. 


□ 
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What the preceding results establish is that the Bayes rejection function related 
to a collection of probability /utility pairs satisfies our axioms. We would like to 
consider the opposite implication, that is, whether or not every rejection function 
that satisfies our axioms is the Bayes rejection function related to some collection of 
probability utility pairs. This consideration will be postponed until another paper. 

4. Pairwise choice is not enough 

Seidenfeld, Schervish and Kadane (1995) consider the first four axioms that we have 
introduced in this paper but restricted to the collection of subsets of the form {x, y} 
with x,y ^ A. That is, Seidenfeld, Schervish and Kadane (1995) consider choices 
between pairs of acts only. They go on to prove that, under these axioms, there 
exists a collection of bounded utilities {V^o : q G K} that agree with all pairwise 
choices in the following sense: {x} -< {r/} if and only if K»(x) < K»(l/) for all a € N. 
The following example illustrates why Axiom 5 is necessary in the case of choice 
between more than two acts at a time. 

Example 1. Let A = {(a,^) : 0 < a, 6 < 1}. Define the rejection function R as 
follows. For (a, 6) G E, (a,b) G R{E) if and only if there exists (c, d) G E such that, 
for every 0 < p < 1, ap + 6(1 —p) < cp + d( 1 —p). It is not difficult to show that this 
rejection function satisfies our first four axioms. However, there is no set of utility 
functions for which this rejection function is the Pareto rejection function. Suppose 
that U were an element of such a set of utility functions. By Axiom 3, U (a, b) would 
have to equal af/(l,0) + 6f/(0, 1), hence 

t/(0.4,0.4) = 0.4[f/(l,0) + f/(0, 1)] < max{f/(l,()),(/(0, 1)}. 

Hence either f/(l,0) > f/(0.4,0.4) or f/(0, 1) > t/(0.4,0.4). Now, let E = {(0.4, 0.4), 
(1,0), (0, 1)}, and notice that R{E) = 0. But every utility function U would reject 
(0.4, 0.4) amongst the actions in E. 

The rejection function in Example 1 is an example of “Miiximality” that was 
introduced by Walley (1990). The distinction between pairwise choice and larger 
choice sets goes beyond the situation of Example 1. Schervish, Seidenfeld, Kadane 
and Levi (2003) look more carefully at the special case of Bayes rejection functions 

in which all Ua are the same function U and {P„ : a € = P, is a convex set of 

probabilities on Cl. We call this the case of a cooperative team. In this case, they give 
an example that illustrates how different sets V lead to the same collections of pair- 
wise choices that satisfy the axioms of Seidenfeld, Schervish and Kadane (1995). 
Hence, pmrwise choices are not sufficient for characterizing the corresponding set 
of probability/utility pairs even in the cases in which such sets of probability/utility 
pairs are known to exist. 

Example 2. Let il = {u;i,a; 2 ,u; 3 } consist of three states. Let 

'Pi = {(Pi,P 2 ,P:i) : P 2 < 2pi for p, < 0.2} 

U {(p>'?^2,P3) : P 2 < 2pi for 0.2 < pi < 1/3}, 

P 2 = {(Pi,P 2 ,P:j) : P 2 < 2pi for pi < 0.2} 

U {(Pi,P 2 ,P:j) : P 2 < 2pi for 0.2 < pi < 1/3}. 

The only difference between the two sets is that (0.2, 0.4, 0.4) G V 2 \P\- Let /?p, 
and Rj>^ be the Bayes rejection functions corresponding to the two sets of prob- 
ability/utility pairs {(p, t/) : p € Pi} and {{p,U) : p G ^ 2 }- Each act x can 
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be represented by the vector whose ith coordinate is a;, = JZrg/t 
for i = 1,2,3. In this way, the expected utility for each probability vector p is 
Vp{x) = x^p. Consider two arbitrary acts x and y. We have {x} € /?p^({x, y}) if 
and only if 

3 

- ^i)p. > 0, forallpGPj. 

1=1 

This is eciuivalent to {y\ - xi,j /2 ~ 2 : 2 , y .3 — X 3 ) being a hyperplane that separates 
{()} from Vj without intersecting Vj. It is easy to check that a hyperplane .separates 
V\ from {0} without intersecting Vi if and only if it separates V 2 from {0} without 
intersecting Vo- The reason is that all of the points in the symmetric difference 
V\AV 2 are extreme but not exposed. Hence, all pairwi.se comparisons derived from 
Rp^ are identical to those derived from . 

Consider now a set of acts E that contains only the following three acts (each 
expres.sed as a vector of its expected payoffs in the three states as were x and y 
above): 

/, = ( 0 . 2 , 0 . 2 , 0 . 2 ), 
f2 = (1,0,0), 

g = (-1.8, 1.2, .2). 

First, let p € Vi- Notice that Vp{f- 2 ) is the highest of the three whenever p\ > 
0.2, Vp{fi) is the highest whenever pi < 0.2, and Vp{g) is never the highest. So, 
Rp^{E) = {y}. Next, notice that if p = (0.2, 0.4, 0.4), then Vp{g) = Vp{f\) = 
V;(/2) = O.2,so/?p,(E) = 0. 0 

Next, we jiresent a theorem which states that the more general framework of 
rejection functions operating on sets of size larger than 2 can distinguish bet ween 
different sets V in the cooperative team case. 

For the general case, let U be a .single, possibly state-dependent, utility function. 
For each probability vector p on and each act x, let 


= E 

ui€n 


f/(r|a;)x(a;)(r) 

.r€/? 


p(u;). 


Becau.se the inner sum Wx(t»'') = lZr€/? depend on p, we can 

represent each act x by the vector 


(iCx(o;i ),..., Wx(u;,)), (2) 

where s is the number of states. That is, each act might as well be the vector in 
(2) giving for each state the state-dependent expected utility with respect to its 
probability distribution over prizes in that state. If we call the vector in (2) by the 
name x, this makes V",,(x) = x^p for every act x and every probability vector p. 

For each convex set V of probability vectors there is a Bayes rejection function 
defined by 

1p(E) = fl e ^ S Mv)- for rrll V « Ef, (3) 

pev 

for all ckxsed sets E of acts. Example 2 shows that there are cases in which V\ ^ 
V 2 but Rp^{E) = Rp^(E) for every E that contains exactly two distinct acts. 
Ther)rem 1 below states that, so long asV\ ^V 2 there exists a finite set E of acts 
such that Rp^{E) ^ Rp^{E). 
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Theorem 1. Let V\ and V 2 be distinct convex sets of probabilities over a set 0 
with s > 2 states. Then there is a set E with with at most s + 1 acts such that 
Rt>,{E)^Rv,{E). 

The proof of Theorem 1 is given in the appendix. 

5. Summary 

In this paper, we consider a generalization of Subjective Expected Utility theory 
in which not all options are comparable by a binary preference relation. We adapt 
Rubin’s (1987) axioms for rational choice functions to permit a decision maker who 
has a determinate cardinal utility U for outcomes to have a choice function over 
simple horse-lottery options that does not coincide with a weak ordering of the 
option space. In calling the decision maker’s choice function “rational”, we mean 
that there is a cardinal utility U and a set V of coherent probabilities that represent 
the choice function in the following sense: The allowed choices from an option set are 
exaictly those Bayes-admis.sible options, i.e. those options that maximize expected 
utility for some probability P €V. 

In Sections 2 and 3 we give axioms that are necessary for a choice function 
to be rational in this sense. We show that the axioms that we used in Schervish, 
Seidenfeld, Kadane and Levi (1995) for a theory of coherent strict partial orders are 
insufficient for this purpose. Specifically, those axioms are for a strict partial order 
^ which is given by pairwise comparisons solely. That theory represents the strict 
partial order ^ by a set of probability/utility pairs according to a Pareto condition, 
where each probability/utility pair agrees with the strict partial order according 
expected utility inequalities. Here we show that the choice function that Walley 
calls “Maximality” obeys tho.se axioms, but fails to have the desired representation 
in terms of Bayes-admissible options when the option sets (which may fail to be 
convex) involve three or more options. Therefore, we add a new Axiom 5 that 
is necessary for a choice function to be rational, and which is not satisfied by 
Maximality. 

In Section 4 we show that, even when a rational choice function is represented 
by a convex set of coherent probabilities, and when the option set also is convex, 
nonetheless the choice function cannot always be reduced to pairwise compari.sons. 
We show how to distinguish the choice functions based on any two different convex 
sets of probabilities using choice problems that go beyond pairwise compari.sons. 

In continuing work, we seek a set of axioms that characterize all rational choice 
functions. The axioms that we offer in Section 2 are currently a candidate for that 
theory. 

A. Proof of Theorem 1 

First, we present a few lemmas about convex sets that will be useful for the proof. 

The following result gives us a way of reexpressing a half-space of a hyperplane 
as the intersection of the hyperplane with a half-space of a more convenient form. 
The main point is that the same constant c that defines the original hyperplane H 
can also be used to define the new half-space. 

Lemma 8. Let H = {x : 0^ x — c} for some vector 0 and .some scalar c ^ 0. 
Let Q be such that 0^a = 0 and let d be a scalar. Then, there is a vector 7 such 
that 

{x € H : a^x > d) = {x € H : 7^x > c}. 
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Proof. It is easy to check that the following vector does the job 

{ cot/d if cd > 0, 

a 0 if d = 0, 

—ca/d + 20 if cd < 0. □ 

Definition 5. We say that two convex sets Vi and V2 intersect all of the same 
supporting hyperplanes if 

• they have the same closure, and 

• for every supporting hyperplane // , /f D Pi / 0 if and only if // n P2 7^ 0- 

Definition 6. Let Pi and P2 be convex sets in R". For i = 1,2, define R-Pi 
as in (3). Let E be a subset of R”, We say that E distinguishes V\ and P2 if 
Rp,{E) ^ Rp,{E). 

We break the proof of Theorem 1 into two parts according to whether or not Pi 
and P2 intersect all of the same supporting hyperplanes. The first part deals with 
cases in which a single pair of acts can distinguish two convex sets. 

Lemma 9. Suppose that two convex sets V\ and P2 do not intersect all of the 
same supporting hyperplanes. Then there is a set E with one constant act and one 
possibly nonconstant act that distinguishes V\ and P2. 

Proof. First, consider the case in which Pi and P2 don’t have the same closure. 



Without loss of generality, let po € P2 H Pi . Let x € R" and c be such that 
p > c for all p € Pi and x^po < c. Let E consist of the two acts x and the 
constant c = (c, . . . , c). Clearly, {c} = Rp^ (E) while c ^ Rp^(E). 

Next, consider the case in which Pi and P2 have the same closure. Without loss 
of generality, let {p : x^p = c} be a supporting hyperplane that intersects P2 but 
not Pi so that x^p > c for all p 6 Pi. Let E = {c, x}. Then {c} = Rp^{E) while 
c 0 Rp^{E). □ 

The following result handles the case in which pairwise choice is not sufficient to 
distinguish two sets. The proof can be summarized as follows. Start with two distinct 
convex sets of probabilities that intersect all of the same supporting hyperplanes. 
Find a supporting hyperplane that they intersect in different ways and use this as 
the first gamble in the set E in such a way that all probabilities in the hyperplane 
give the gamble the same expected value (say c) and the rest of both convex sets 
give the gamble smaller expected value. Put the constant c into E as well. Now, 
the only probabilities that keep the first gamble out of the rejection set are in the 
hyperplane. We now add further gambles to E in a sequence such that the next 
one has expected value greater than c except on a boundary of one less dimension 
than the previous one. By so doing, we reduce the set of probabilities that keep the 
first gamble out of the rejection set by decreasing its dimension by one each time. 
Eventually, we get the set of such probabilities to a zero-dimensional set (a single 
point) that lies in one of the two original convex sets but not the other. 

Lemma 10. Let V\ and V2 be distinct convex sets of probabilities in R* (s > 2 ) 
that intersect all of the same supporting hyperplanes. Then there is a set E with at 
most s + 1 gambles that distinguishes V\ and V2 • 
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Proof. Clearly the difference between Vi and V 2 is all on the common boundary. 
Hence, there is some supporting hyperplane that intersects both sets but in different 
ways. Let such a hyperplane he H\ = {p : xjp = c} such that for al\ p £ Vi, xj p < c 
for i = 1,2. Let Vij = ViC\ Hi for i = 1,2. Let the finst two gambles in E be ii 
and c. (If c = 0, add a constant to every coordinate of Xi and replace c by that 
constant.) The remainder of the proof proceeds through at most .<? — 1 additional 
steps of the type to follow where one new gamble gets added to E at each step. 
Initialize j = 1. 

By construction, Pij and V 2 ,j are distinct convex sets that lie in an s - j 
dimensional hyperplane. If these sets intersect all of the same supporting hyper- 
planes (case 1), then find a supporting subhyperplane //j+i of Hj that intersects 
Vij and V 2 J in different ways. If the sets V\,j and V 2 J don’t intersect all of the 
same supporting hyperplanes (case 2), use Lemma 9 to find a subhyperplane //j+i 
of Hj that distinguishes them. In either case, use Lemma 8 to extend Hj^^ to 
Hj+i = {p : xjp = c} such that xjp > c for all p € Vij for both i = 1,2. Include 
Xj in E. Define Vij+i = Vij fl //j+i for i = 1,2. 

If case 2 holds in the previous paragraph, skip to the next paragraph. If case 1 
holds in the previous paragraph, then increment j to j + 1 and repeat the construc- 
tion in the previous paragraph. Continue in this way either until case 2 holds or we 
arrive at j = s — 1 with one-dimen.sional .sets V\,s-\ and V 2 ,a-i, which then must 
be bounded line segments. They differ by at least one of them containing a point 
that the other does not contain. Without loss of generality, suppose that V 2 ,a-\ 
contains a point po that is not in Vi^a-i. Create one last vector Xs so that xjpo = c 
and xjp > c for all p € Vi^s-i- 

Every gamble x G E satisfies x^po = c, w^hile for every p GV\, there is A: > 2 
.such that xjp > c. It now follows that x\ € Rr>i(E) but x\ ^ R-p.^{E). □ 
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On the distribution of the greatest 
common divisor 

Persi Diaconis' and Paul Erdos* 

Stanford Univemty 

Abstract: For two integers chosen independently at random from {1,2,..., x), 
we give expaiisiorut for the distribution and the moments of their greatest com- 
mon divisor and the least common multiple, with explicit error rates. The ex- 
pansion involves Riemann’s zeta function. Application to a statistical question 
is briefly discussed. 

1. Introduction and statement of main results 

Let M and N l>e random intergers chosen uniformly and independently from 
{1,2, Throughout (M,N) will denote the greatest common divisor and 
[A/, jV] the least common multiple. CesAro (1885) studied the moments of (M,N) 
and [A/, Af], Theorems 1 and 2 extend liis work by providing explicit error terms. 
The distribution of (A/, N) and [A/, Af] is given by: 

Theorem 1. 

[A/, N] < tx^ and (A/, N) = A;} 

fi ’*^‘1 /I \ 

Ps{lM,N] = l + {Ml ~ log jt) - i} + oJ j. (1.3) 

j=i ' ' 

H'Tiere [a:] denotes the greatest integer less than or equal to x. Christopher (1956) 
gave a weaker form of (1.2). 

(1.2) easily yields an estimate for the expected value of (A/, AT); 

E,{(A/, AT)} = 1 ^ (i,j) = ^ A- Px{(A/, AT) = A} = 4 logx + 0(1). 

^ i,j<X k<T ** 

(1.2) does not lead to an estimate for higher moments of (A/, N). Shnilarly the form 
of (1.3) makes direct computation of moments of [A/, AT] unwieldy. Using elementary 
arguments we will show: 

Theorem 2. 

£x{(A/,A')} = 4'ogx + C + o(^^) (1.4) 

* Department of Statistics, Stanford University, Stanford, 94305-4065, CA USA. e-mail: 
diaconl66math.8tanford.edu 
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where C is an explicitly calculated constatiL 

for /c>2, = (1.5) 

where C,{z) is Riemann's zeta function, 

for = + (1.6) 

Section two of this paper contains proofs while section three contains remarks, 
further references and an application to the statistical problem of reconstructing 
the sample size given a table of rounded percentages. 

2. Proofs of main theorems 
Throughout we use the elementary (^timate 

*W= •p(k) = + K(x) (2.1) 


where R{x) = 0(a:loga:). 

See, for example. Hardy and Wright (1960) Theorem 330. Since # {m, ri < x : 
(m, n) = 1} = 2^(a:) + 0(1) and (m,?/) = A: if and only if A:|m, A"|n and 
(T’I) “ ^ {m,n < X : (m,n) = A’} = 24>(|) + 0(1). This 

proves (1.2). To prove (1.1) and (1.3) we need a preparatory lemma. 

Lemma 1. If Fx{t) = ff{m,n < x : mn < tx"^ and (m,n) = 1}, then 

Fx{t) = -^ t{\ - \ogt)x'^ + Oi{x\ogx). 

Proof. Consider the number of lattice points in the region Rx{t) = {m, n < x : 
mn < tx^). It is easy to see that there are ^(1 — logt).T^ + Of(x) = Nx{t) such points. 
Also, the pair (m, n) € Rjc{t) and {m,n) = k if and only if (^, G Rx/k{l) and 
(t* f ) ~ ^ - "Thus Nx{t) = Px/d{l)- The standard inversion formula says 

^x(0 = M^x/diO = -^ <(1 - logO^^^ + Ot(a: log a:). 

l<W<x ^ 


Lemma 1 immediately implies that the product of 2 random integers is independent 
of their greatest common divisor: 

Corollary 1. 


Px{MN < fx^K.A/, N) = A-} = f(l - logt) + Ot,k 



To prove (1) note that 


Px{[M,N\ < 


tx^ and {M,N) = k} 

P^{[A/,A^1 < tx%M,N) = k} ■ Px{{M,N) = A:} 
x%M,N) = a| • Px{{M, N) = A-}. 




MN < j 

K 
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Use of (1.2) and Corollary 1 completes the proof of (1.1). To prove (1.3) note that 
Px{(A/,fV]<<r*} = P,|(A/,N)> i I 


|i/«l 

+ H ^x{(A/,fV| < tx‘^\(M,N) = k} ■ P^{{M,N) = k}. 

k=l 


Using (1.2) and Corollary 1 as before completes the proof of Theorem 1. 
To prove Theorem 2, write, for 1: > 1, 


Y, irn,nf = 2 ^ Y ("»>»)*- Y 

m.n<x l<m<x l<n<m 


(2.2) 


where /k(m) = d*¥’(5). Dirichlet’s Hyperbole argument (see, e.g., Saifari 

(1970)) yields for any t, 

Y Mm)= Y v’(o/k(j)-/km(j) (2.3) 

l<m<i l<i<l ' ' Ki<x/t ^ ' ' ' 


where 


f*+> 


/,«)= Y = 


When k = 1, we proceed as follows: Choose t = \/x. The first sum on the right 
side of (2.3) Ls 


l<fc<v^ 

= log v/J + 7 + I + logi). 


(2.4) 


The second sum in (2.3) is 




I<k<y/S 


l<k<y/x 


Now 


y- ^ 2A-+1 I *(\/^ 

k^ ^^(lt(lt+l))2 [x] 

l<k<^ l<*<Vx ' ' ' 




6 k 

^ ^^(k+ 1)2 

1 <lc< 


3 , 

= -jlogr + d + 


R{k) , ^ *(k) 

2_ tri. + 1 12 




°(^) 
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wliere 


= E {'*’W + + 1))' + ^ (^ + 5) (2.6) 

and 7 i.s Euler's constant. Using this in {■quation (2.5) yields that the stvond sum 
in (2.3) is 


— 1 
27T^ 


The third term in (2.3) is 


1 3 

2 7 t 2^ 


logr). 

(2.7) 

logx). 

(2.8) 


Combining (2.8), (2.7) and (2.4) in (2.3) and using this in (2.2) yields: 

^ (m,7i) = -^x^logx+ (d+ -(^^7 + + 0(i-*/^logx), 

m.n<z ^ \ ^ \ / / 

where d is defineri in ( 2 . 6 ). 

When k > 2, the best choice of t in (2.3) is f = 1. A calculation very similar to 
the case of /; = 1 leads to (1.3). 

We now prove (1.6). Consider the sum 


i.j<i 


i<x j<t 


i<x d|» ><* ^ ' 

i<z dli ^ ' 


i<x d|t ><« 

I 

<t| 

d=\ j^x/fi 


(2.9) 


Where 


A(n)= E J'- 


><" 

(>.")= 1 

We may derive another expression for fk(n) by considering the sum 

,,*+1 


E‘" = ftt + "*("^ = "‘'E 

*=l din 


Md) 
d>‘ ■ 


( 2 . 10 ) 


Dividing (2.10) by n* and inverting yields 
fk(n) 1 


n* k + 


d|n ^ d|n ^ ' 
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When we substitute this expression for fk{j) in (2.9) we must evaluate: 

3<V j<y d\j ^ ^ \ ^ 

= ^ Rk{d)d^. 

i<y d<y/i 


Now Rk{d) is a polynomial in d of degree k. Thus, 

2fc+l 


Ij =0(»"+Mogj,). 


We must also evaluate 


3<y 

N «* ' *i ' * 


k 


y 


2k+2 


3 

H — ^ 


‘3<y 
1 


7t2(A- + 1) (2A: + 2) 7t2 (A:+1) 

6 


y2k+2 ^ 


7r2(Ar + 1) 
3 1 




^2A:+2 


7t2 {k + 1)2' 

Sxibstituting in the right side of (2.9) we have 


E 


2E‘'‘{s.(5)+S2(i)}+o(x‘-) 


6 




7t 2 (A: + 1)^ 
C(fc + 2) 


d=i 


C(2) {k + 1)2 


+ 0(x2^+Mogx). 


□ 


3. Miscellaneous remarks 


1. If A/i, A/ 2 , . . . , A/it are random integers chosen uniformly at random then the 
results stated in Christopher (1956) (see also Cohen (I960), Herzog and Stewart 
(1971), and Neymann (1972)) imply that 


c.{(AA, A'/2 "‘) = ^ ^ ^ 

We have not tried to extend theorems 1 and 2 to the k-dimensional case. 

(3.1) has an application to a problem in applied statistics. Suppose a population 
of n individuals is distributed into k categories with n individuals in category i. 
Often only the proportions p, = rii/n are reported. A method for estimating n 
given Pi, 1 < i < k is described in Wallis and Roberts (1956), pp. 184-189. Briefly, 
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k 

let m = min | ^ p,- 6 i | where the mininmin is taken over all k tuples (fei , 62 , . . . , 6 jt), 
i=I 

with bi € {0, ±1, ±2, . . .} not all bi equal zero. An estimate for n is [l/m]. This 
method works if the p, are reported with enough precision and the n, are relatively 
prime for then the Euclidean algorithm implies there are integers such that 

6 , n, = 1. These 6 , give the minimum m = If it is reasonable to approximate 
the n, as random integers then (3.1) implies that Prob((ni, U 2 , . . . , Uk) = 1) = 

and, as expected, as k increases this probability goes to 1. For example, = .964, 
= .992, = .998. This suggests the method has a good chance of working 

with a small number of categories. Wallace and Roberts (1956) give several examples 
and further details about practical implementation. 

2. The best result we know for R{x) defined in (2.1) is due to Saltykov (1960). 
He shows that 

R{x) = 0(x(logx)^/‘^(logloga;)^‘*’‘). 

Use of this throughout leads to a slight improvement in the bounds of theorems 1 
and 2 . 

3. The functions {M,N) and (A/, A^] are both multiplicative in the sense of 
Delange (1969, 1970). It would be of interest to derive results similar to Theorems 1 
and 2 for more general multiplicative functions. 
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Versions of de Finetti’s Theorem with 
applications to damage models* * 

C. R. Rao* and D. N. Shanbhag^' 

The Pennsylvania State University 

Abstract: Alzaid et al. (1986) mid Rao et al. (2002) have shown that several 
of the results on damage models have links with certain results on nonnegative 
matrices. Rao et al. (2002) have also shown that there is a connection between 
a specialized version of de Finetti’s theorem for discrete exchangeable random 
variables and a potential theoretic result relative to nonnegative matrices. 

In the present article, we deal with integral equations met in damage model 
studies via specializcxi versions of de Finetti’s theorem and extend further the 
theorems of Rao and Rubin (1964) and Shanbhag (1977) on damage models. 


1. Introduction 

The concept of damage models was first introduced by Rao (1963) and it has led 
to many interesting and illuminating characterizations of discrete distributions; 
among various noteworthy results in the area are those of Rtto and Rubin (1964) 
and Shanbhag (1977). In mathematical terms, a damage model can be described by 
a random vector (X, F) of non-negative integer-valued components, with the joint 
probability law of X and Y having the following structure: 

P{X=x,Y = y} = S{y\x)g^, r/ = 0, 1, 2, . . . , ar; x = 0, 1, 2, . . . , (1.1) 

where {S(j/|x) = P{Y = y\X = x} : y = 0, 1,2, is a discrete probability 
law for each ar = 0, 1, 2, . . . and {qx = P{X = a:} : x = 0, 1, 2, . . .} is the marginal 
probability law of X. In the context of damage models, the conditional probability 
law {5’(y|x) : y = 0, 1, 2, . . . ,x} is called the survival distribution. It is also natural 
to call Y the undamaged part of X and X — Y the damaged part of X. Multivariate 
versions of the terminologies have also been dealt with in the literature. Rao and 
Rubin (1964) showed via Bernstein’s theorem for absolutely monotonic functions 
that if the survival distribution is binomial with parameter vector (x, p) for almost 
all X (i.e. for each x with </x > 0)i where p G (0, 1) and fixed, and r;o < li then the 
Rao-Rubin condition (RR(0)) 


P{X = y} = P{Y = yfX = Y}, y = 0, 1, 2, . . . (1.2) 

is met if and only if X is Poisson. It was pointed out by Shanbhag (1977) that an 
extended version of the Rao-Rubin result can be deduced from the .solution to a 

*One of us has collaborated with Professor Herman Rubin on a result which is now known 
in statistical literature as the Rao-Rubin theorem.This theorem and another result known as 
Shanbhag’s theorem have generated considerable research on characterization problems. Our paper 
on these theorems and some further results is dedicated to Professor Rubin in appreciation of his 
fundamental contributions to statistical inference. 

* Address for correspondence: .3 Worcester Close, Sheffield SlO 4JF, England, Uniter! Kingdom. 

'Center for Multivariate Analysis, Thomas Building, The Pennsylvania State University, Uni- 
versity Park. PA 16802, USA. e-mail: d.shanbhagCbtopenworld.com 

Keywords and phrases: de Finetti’s theorem, Choquet-Deny theorem, Lau-Rao^ Shanbhag 
theorems. Rao-Rubin-Shanbhag theorems, Rao’s damage model, Rao Rubin condition. 
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general recurrence relation of the form 


00 



(1.3) 


whore : m > 0} Ls a given soquenco of nonnogative real miinbers witli Wi > 0 
and {n„ : n > 0} is a so(|uomo of nonnegative real numbers to be determined. 
Using es.sentially a renewal theoretic approach, Slianbhag obtaimHl a complete .so- 
lution to (1.3), whicli provided a unifierl approach to a \'ariety of characterizations 
of discrete distributions inch:ding, in particular, thos<‘ related to damage models, 
strong memoryless prop<>rty, order statisticts, record rallies, etc. 

Shanbhag’s (1977) general result on damage models states essentially (in the 
notation described above) that if go < 1 and, with {(a„,b„) : n = 0, 1, . . .} as a 
sequence of 2-component real vwtors such that n„ > 0 for all ri, bo, bi > 0, and 
bn > 0 for all n > 2, we have, for almost all x. 


then the following are eijuivaleni : 

(i) (1 ■ 1) (i.e. RR(0)) is met; 

(ii) Y and X — Y are independent: 

(iii) (gi/cx) = (go/co)A^, x = 0, 1,..., for some A > 0, where {c„} Ls the convo- 
lution of {a,,} and {6,,}. 

Characterizations of many standard discrete distributions in damage model 
studies follow as corollaries to this latter result. In particular, taking a„ = p"/n!, n — 
0 , 1 ,..., and 6„ = (1 — p)"/ti!, ii = 0. 1 , . . . , where p e (0, 1) and fixed, w-e get the 
Rao- Rubin (1964) theorem as a corollary to this. There are several other intensiting 
contributions to the literature on damage models. Rao and Slianbhag (1994; Cha|>- 
ter 7) have reviewed and unified most of these. More recently, Rao et al. (2002) and 
Rao et al. (2003) have provided .systematic approaches to damage models basisl 
on nonnegative matrices and Markov chains. In particular, Rao et al. (2002) have 
shown that several of the findings on damage models in the literature are corollaries 
to a potential theoretic result, apjiearing as Theorem 4.4.1 in Rao and Slianbhag 
(1994), on nonnegative matrices: these suKsume some of the results in the area baseil 
on the version of de Finetti's theorem for discrete exchangeable random variables. 

The purijose of the present paper is to go lieyond Rao et al. (2002) and show, 
amongst other things, that certain specialized versions of de Finetti's th«)reui or 
the relevant moment arguments provitle us with further novel approaches to arrive 
at the Rao Rubin Slianbhag theorems or their generalizations. In the process of 
doing this, we also establish some new results on damage models or otherwLse, 
including, in jiarticular, an improved version of the crucial result of Alzaid et al. 
(1987a). 

2. Simple integral equations in damage model studies 

The link between the Choquet Deny type integral equations and exchangeability 
or, in ijarticular, certain versions of de Finetti’s theorem for an infinite .sequence 


5(i/|x) cx iiyb. 


y<>x-y. 


g = 0. 1, . . . ,X, 
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of exchangeable random variables is well-documented in Rao and Slianbhag (1994) 
and other places in the literature. Some specialized versioms of de Finetti’s theo- 
rem follow via simple arguments involving, among others, moments of probability 
distributions, or a potential theoretic result on nounegative matrices; see, for ex- 
ample, Feller (1966, pp. 225-226) and Rao et al. (2002). A detailed account of the 
literature on de Finetti’s theorem is provided by Aldous (1985); see, also, Chow 
and Teicher (1979) for an elegatit proof of the theorem in the case of real-valued 
random variables. 

Our main objective in this section though is to verify certain key results on func- 
tional equations with applications to damage models, as corollaries to specialized 
versions of de Finetti’s theorem; the theorems and corollaries that we have dealt 
with in this section are obviously subumed by the relevant general results obtained 
via certain other tetdiniques in Rao and Shanbhag (1994, Chapter 3) and Rao and 
Shanbhag (1998). 

Theorem 2.1 (Shanbhag’s Lemma [32]). Let {(un,u>„) : n = 0, 1,...} be a 
sequence of 2-vectors with nonnegative real components, such that v„ > 0 for at 
least for one n > 0 and uq > 0. Then (1.3) is met if and only if, for some b > 0, 

OO 

v„ = vob", ri = l,2,..., and ^ ie„5" = 1. (2.1) 

n=0 

Proof. The “ir’ part of the assertion is trivial. To prove the “only if i>art of the 
assertion, let (1.3) be met with the stated assumptions. Since in that case we have 
u„(l — Wo) > tciu„+i, n = 0,1,..., it is clear that tco < 1 and tp > 0. (Note that 
Shanbhag ( 1977) observes via a slightly different argument that > 0 for all n > 0, 
but, for us, it is sufficient to have that vo > 0.) Es.sentially from (1.3), we have then 
that there exists a sequence {X„ : n = 1, 2, . . .} of 0-1-valued exchangeable random 
variables satisfying 

/’{Xi = ••• = X„ = 1} = — in?, n = l,2 (2.2) 

Co 

(For some relevant information, see Remark 2.6.) From the corresponding si>ecial- 

ized version of de Finetti’s theorem, we have hence that {— te" : n = 0, 1, . . .} is 

fo 

a moment sequence of a (bounded) nonnegative random variable, which, in turn, 

implies that {— : n = 0, 1, . . .} is a moment sequence of a (bounded) nonnegative 
Co 

random variable. Denoting the random variable in the latter case by Y and appeal- 
ing to (1.3) in conjunction with the expression for Z. we get. in view of Fubini’s 
theorem.or the monotone convergence theorem, that 

E(Z) = E(Z^) = 1, (2.3) 

wliere 

oo 

2 = ^te„T". (2.4) 

n— 0 

Ftom (2.3), noting, for example, that E{{Z — 1)*} — 0, we see that Z - 1 a.s.; 
consequently, from (2.4) and, in particular, the property that u'u < 1, we get that 
there exists a number 6 > 0 such that Y = b a.s. and ^nb" = 1. Since 

— = £'(r"), n = 0,l,..., 

Co 

we then see that the "only if part of the theorem holds. □ 
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Theorem 2.2. Let k be a positive integer and No = {0,1.2,...} and {(vn. U'g^) : 
n € Nq } be a sequence of 2-vectors of nonnegative real components such that I’o > 
0, U)o < 1 and w„ > 0 whenever n is of unit length. ( The. notation 0 stands for n 
with all coordinates equal to zero.) Then 

^ + n € Nq (2*5) 

m€Nn 


if and only if {t;„/i>o} is the moment sequence relative to a k-component random 
vector (Yi,...,Yic) with Yr's as nonnegative and bounded such that (in obvious 
notation) 

Ic 

E n 'r'”^ = 1 (2-6) 

n€N; r=l 

Proof. It is sufficient, as in the ca.se of Tlieoreni 2.1. to prove the “only iP part of 
the assertion. Clearly under the assumptions of the theorem taking for convenience 
k > 2, the validity of (2.5) implies the existence of a .sequence ; m = 1, 2, . . .} of 
exchangeable random variables, with values in {0, 1, . . . , satisfying (with obvious 
interpretation when some or all of the n^'s equal zero) 

P{Xi,...,X„,+. ,+„^ are .such that the first ni of these equal 1, the next 02 
equal 2, and so on} 


k 

= n i^( = e (2.7) 

where /(r) is the rth row of the k x k identity matrix. (For some relevant infor- 
mation, see Remark 2.6.) Using the appropriate version of de Finetti’s theorem and 
following a suitably modified version of the relevant part of the argument in the 
proof of Theorem 2.1, we .see that there exists a random vector (Fi, . . . , F*) as in 
the assertion with {vj^i'o} as the corresponding moment se(|uence; note especially 
that in this latter case (2.3) holds with Z given by the left hand side of (2.6). □ 

Corollary 2.1 (Hausdorff). .4 sequence {p„ : n € Nj} of real numbers represents 
the moment sequence of some probability distribution concentrated on (0, !]*■' if and 
only if fia = 1 and 

m*.n)eN,f, (2.8) 

where A, is the usual difference oj)crator acting on the ith coordinate. 

Proof. Define the left hand side of the inequality under (2.8) by mi,,ni n^)* 

Then, we can easily verify that 


n*) 

“ {*^(mi + l mit.ni n*) + * ' ' + *^(mi .....mit + 1 .n | , 

"h *^(mi rnfc.nj + “1" ' ' ’ “h .....mi, ,ni 

(nil nu-.n) € N?,*. 


,nfc) 


BecaiLseof (2.8),Th(S)rem 2.2 implies then that (Pn : n € Nj} (i.e. {i’(o.„) ; n € NJ}) 
is the moment sequence relative to a t-component random vector (Fi , . . . , >*) with 
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Yi's bouncItKl and nonnegative. In view of (2.8), it follows further that : 

rtr = 0, 1, . . .} is decreasing and hence, it is obvious that the “iP part of the result 
holds. The “only if’ part here is trivial and therefore we have the corollary. □ 

Remark 2.1. .Although Theorem 2.1 is a corollary to Theorem 2.2, we have dealt 
with it separately because of its importance in characterization theory relative to 
univiiriate di.screte distributions. Theorem 2.2, in turn, is a corollary to a result of 
Res.se! (1985) and also to that of Rao and Shanbhag(1998) established via certain 
general versions of de Finetti’s theorem, but its proof given by iLS here could ajjpeal 
to the audience due to its simplicity. It may also be worth pointing out in this 
place that Chapter 3 of Rao and Shanbhag(1994) reviews and unifies, amongst other 
things, martingale approaches to certain generalized versions of Theorem 2. 2, implied 
earlier;the citerl chapter also shows.explicitly or otherwi.se, using partially a different 
route to ours that the following Corollaries 2.1 and 2.2 are consequences of the 
general results. 

Remark 2.2. Corollary 2.1 can also be proved directly via de Finetti’s theorem 
noting that there exists a sequence : n = 1,2,...} of exchangeable random 
variables with values in {0, 1, ... , k} and satisfying (2.7) with its right hand side re- 
placetl by Also, since {nn} in Corollary 2.1 is the moment sequence 

relative to a probability distribution with compact support, it is obvious that it de- 
termines the distribution; in view of this, we can easily obtain the following result 
as a further corollary to Tlicorem 2.2 

Corollary 2.2 (Bochner). Let f be a completely monotonic function on (0,oc)*^. 
Then f has the integral repi'csentation 


Proof. Given any Xq 6 (0, oo)*^. Corollary 2.1, on taking into account the latter ol> 
servation in Remark 2.2 and the continuity of /, imiilies after a minor manipulation 
that there exists a probability measure on [0, oo)*^ such that for all A:-vectors r 
with positive rational components 


Since f(x„ + •) is continuous on [(), oo)^, (2.10) implies becau.se of the dominated 
convergence thw)rem that 


In view of the arbitrary nature of Xq and the uniqueiu^ss theorem for Laplace- 
Stieltjes transforms, we have (2.9) to be valid with as unique and such that, 
irrespectively of what Xg is, 


./(O.oo)*' 

with u as a uniquely determined measure on [0,oo)*“‘. 




( 2 . 10 ) 



./(O.oo)*' 


My) = fUo)^-^p{(y>^}} (o,oo)'‘‘. 


Hence, we have the Corollary. 


□ 
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Remark 2.3. Bernstein’s theorem for completely monotonic or absolutely monotonic 
functions is indeed a corollary to Corollary 2.2. Rao and Rubin (1964) have used 
this theorem to arrive at a characterization of Poistson distributions based on a 
damage model. There are also further applications of the tlieorem to damage mod- 
els; see, for example, the next section of the present paper. Tal walker (1970) has 
given an extended version of the Rao -Rubin result via Corollary 2.2, while Puri and 
Rubin (1974) have given representations of relevance to reliability essentially via 
Corollaries 2.2 and 2.1, respectively; for certain observations on these latter results, 
see, for example, Shanbhag (1974) and Davies and Shanbhag (1987). 

The following theorem of Rao and Shanbhag (1994, p.l67), which is an extended 
version of the results of Rao and Rubin (1964) and Tal walker (1970) referred to in 
Remark 2.3 above as well as of the relevant result in Shanbhag (1977), is indeerl 
a corollary to Theorem 2.2; this obviou-sly tells us that Theorem 7.2.6 of Rao and 
Shatibhag (1994) is also subsumed by Theorem 2.2. 

Theorem 2.3. Let ( ) fee random vector such that X_ and Y_ are k-component 

veMors satisfying 

P{X = n,Y = r}=g„S{r\n), r€ [Q,n]nN5, n 6 NJ 
with : n € Nq} as a probability distribution and. for each n for which g„ > 0, 

■S(r|n) = ^-=-=, r e [O.n] n NJ, n e NJ, 

'hi 

where : u € N*} and {6„ ; n G Nj} are respectively positive and nonnegative 
real sequences with 6o > 0 and b„> 0 if n is of unit length, and {c„ : n € Nj} is 
the convolution of these two sequences. Then 

= r} = P{v: = r|A; = Zl.reNj, (2.11) 

if and only if (in obvious notation) 

= ^ ( 2 . 12 ) 

with (o® = 1 and)i> as a finite measure on [0, oo)* such that it is concentrated for 
some 0>O on *’"0^=1 = |3}- 

The above theorem follows on noting especially that (2.11) is equivalent to 

ynfr^CC ^ ' bfn(gm+n / Cm+n)i u€Nq. 
mgN* 

To provide a further generalization of the Rao-Rubin-Shanbhag theorems, con- 
sider 5 to be a countable Abelian semigroup with zero element, equipi>ed with 
discrete topology, and S' C S such that given ie : S — • [0, oo) with supp (te)(= {x : 
w(x) > 0}) = S', any function r : S — ♦ [0. oo) with t>(0) > 0 cannot be a solution 
to 


= X] + y)My)y € 5 (2.13) 

y€S 
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unless it has an integral representation in terms of w-harinonic exponential func- 
tions, with respect to a probability measure. (By a te-harmonic exponential function 
here, we mean a function e : 5 — » [0, oo) such that e{x + y) = e{x)e{y),x,y € 5, 
and e{x)w{x) = 1.) Examples of such 5, S* have been dealt with by Rao and 
Shanbhag (1998) and studied implicitly or otherwise by Rao and Shanbhag (1994). 
Suppose now that a : 5 -+ (0, oo) and b : S -* [0, oo) are such that 6(0) > 0 and 
there exists c : S —* (0, oo) as the convolution of a and 6, and Y and Z are random 
elements definerl on a probability space, with values in S, such that 

P{Y = y,Z = z}= g{y + y,z&S, 

c(y + z) 

where {<7(ar) : x € 5} is a probability distribution. If supp(6) = 6’*, then it easily 
follows that 

P{Y = y] = P{Y = y\Z^Q), y e S, 

if and only if g{x)/c{x),x G 5, is of the form of a constant multiple of the .solution 
V to (2.13) with, for some 'y > 0, w replaced by 76; this latter result is clearly an 
extended version of Theorem 2.3. 

Remark 2.4. In view of Rao et al. (2002), the link between the general result 
relative to a countable semigroup that we have met above and Tht^orem 4.4.1 of 
Rao and Shanbhag (1994) or its specialized version appearing in Williams (1979) 
is obvious. The arguments in Rao and Shanbhag (1994) for solving general integral 
e<piations on semigroups, including those involving martingales obviously simplify 
considerably if the semigroups are countable; we shall throw further light on these 
issues through a .separate article. 

Remark 2.5. Modifying the proof of Theorem 2.1 slightly, involving in particular 
a further moment argument, a proof based on the version of de Finetti’s theorem 
relative to 0-1- valued exchangeable random variables can be produced for Corol- 
lary 2.2.3 ajjpearing on page 31 in Rao and Shanbhag (1994). (Note that the version 
of (1.3) in this case implies that there exists a nonnegative bounded random vari- 
able V' such that E{Y"*^*) = n = 0,1,..., for each m \yith w,n > 0.) This 
latter result is indeed a corollary to the Lau-Rao theorem ([13], [20]), and, in turn, 
is es.sentially a generalization of Shanbhag’s lemma. As {jointed out by RiU) and 
Shanbhag (2004), in view of Alzaid et al. (1987b), there exists a proof for the Lau- 
Rao theorem based, among other things, on the version of de Finetti’s theorem 
just referred to; there also exist possibilities of solving integral equations via this 
or other versions of de Finetti’s theorem, elsewhere. 

Remark 2.6. Suppose S is a countable Abelian semigroup with zero element, 
eejuipped with discrete topology, and v and w are nonnegative real-valued functions 
on S such that u(0) > 0, w{0) < 1, and (2.13) is met. Then there exists an infinite 
sequence {A'^ ; n = 1, 2 , . . .} of exchangeable random elements with values in S for 
which for each positive integer n and x[,...,x[^ € S, 

n 

F{X[= x\ , A-; = 4 , . . . , X'„ = x'„} = (v(x’,+---+ x;,)/t>(0)) w(x'). (2.14) 

1=1 

If Si, i = 1 ,...,A- (with A* > 1), are distinct nonzero members of S such that 
w{si) > 0, 1 = 1, . . . , A', taking for examjjle, X„, n = 1,2,..., such that 

if X'„ = Si, i = l,...,k. 
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we can now see tliat there exists a sequence : n = 1,2,...} of exchangeable 
random variables with values in {0, 1, .... A:} for which (2.7) (when its left hand side 
is read as that of (2.2) with ii\ in place of n if A- = 1) is valid, provided its right hand 
side is now replaced by Il!=i(*''’(‘^i))"‘- Con.sequently, in view of the 

relevant version of de Finetti’s theorem, it follows that even when .Sj, i = 1,...,A*, 
are not taken to be di.stinct or nonzero, provided a;(.s,) > 0, i = 1,...,A', we 
have I t>(»isi+" +ntSfc) . _ 0 ^ i....} to be the moment secpience of a 

probability dLstribution on with support as a compact subset of (0, oo)*^. 


3. Spitzer’s integral representation theorem and relevant observations 

This section is devoted mainly to illustrate as to how Ih'rnstein’s theorem on al>- 
solutely monotonic functions,referred to in Remark 2.3, in conjunction with Ya- 
glom’s theorem mentioned on page 18 in Athreya and Ney (1972), leads us to an 
improved version of the key re.sult of Alzaid et al. (1987a) and certain of its corol- 
laries. 

Suppo.se {Zfi : n = 0, 1,...} is a homogeneous Markov chain with state space 
{0,1,...}, such that the corresponding one-step transition probabilities are given 
by 


Pij = P{Zn+i = j\Z„ = i} 

^ f cpj‘\ i=(),l,...; j=l,2,..., 

\ 1 -c + cpll\ i=0,l,. ..; j=0, 

where c € (0, 1] and : j = 0, 1,. . .} is the t-fold convolution of some proba- 
bility distribution {pj} for which pa € (0, 1), for i = 1,2,..., and the degenerate 
distribution at zero if i = 0. Clearly, this is an extended version of a Bienaym^ 
Galton-Wat.son branching process; indeed, we can view the latter as a special case 
of the former with c = 1 . 

Under the condition that m = ^ with m* = ^j^i(jTogj)pj < oo, 

Alzaid et al. (1987a) have given an integral representation for stationary measures 
of the general process referred to above. A specialized version of this representa- 
tion in the case of c = 1 was essentially e.stablish('d earlier by Spitzer (1907); this 
latter result appears aLso as Theorem 3 in Section 2 of Chapter II of Athreya and 
Ney (1972). The general representation thee)rem as well as its speicializeel version 
follow via Martin bounelary relatexl approaches or their alternatives involving spe- 
cific tools such as Bernstein’s tluH)re'm e)u absolutely me)note)nic functions, .se?e, for 
example, Alzaid et al. (1987a) and Rao et al. (2002) for some relevant arguments 
or olxservations in this connection. 

From a minute scrutiny of the proof pre)vide>d by Alzaid et al. (1987a) fe^r the 
general representation theorem, i.e. Theorem 2 in the cited reference, it has now 
emerged that the theorem referred to holds even when the constraint that m* < oo 
is dropper!. Inderxl, Yaglom’s theorem mentioiUKl on page 18 in Athreya and Ney 
(1972) implies (in obvious notation) that if m < 1, then, irrespective of whether 
or not m* < oo, {^„} converges ]K)intwise to B; es.sentially, the argument on page 
1212 in Alzaid et al. (1987a) to show that a certain function, U‘, is the generating 
function of a nonnegative seriuence then remains valid and gives us specifically the 
sequence to be that corresponding to a stationary measure of the process with 
Po = 1 — m and p\ = m, without requiring that m* < oo. (One can aLso, obviously, 
give the argument implied here in terms of /„, the 7ith iterates of /, directly without 
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involving Q„; note that we use, as usual, the notation / for the generating function 
of {pj}.) 

The original form of Spitzer’s theorem, involving, amongst other things, the 
parameter Q(0), requires the assumption of rn* < oo. [Note that fn{s) = — 

m” + and hence Qn(0) = ^ — m”) — l)/r7i” has a 

m” 

nonzero limit Q(0) as n — > oo only of < oo and hence only if m* < oo; see 

the proof of the theorem on page 70, in conjunction with the remark on page 18, in 
Athreya and Ney (1972).] However, from what we have oKserved above, it is clear 
that this latter theorem holds even when the assumption mentioned is deleted, 
providefl “—1” is taken in place of “Q(0)” in the statement of the theorem. 

As a by-product of the revelation that we have made above, it follows that 
if m < l,(/(') is the generating function of a stationary measure of the process 
if and only if it is of the form U*{B{-)) with U* as the generating function of a 
stationary measure in the special case where po = 1 — m, pi = m. This is obviously 
a coiLsequence of Yaglom’s theorem, in light of the extended continuity theorem 
of Feller (1966, page 433). The example given by Harris, appearing on page 72 of 
Athreya and Ney (1972), to prove the existence of stationary measures does not 
require m* < oo and is of the form that we have met here; clearly it is not covered 
by Spitzer’s original representation theorem. As implied in Alzaid et al. (1987a), a 
representation for U* itself in our general case follows essentially as a consequence 
of Bernstein’s theorem on absolutely monotonic functions or the Poisson-Martin 
integral representation theorem for a stationary measure: see, also, Rao et al. (2002) 
for some relevant observations. 

Taking into account our observations, it is hence seen that the following modified 
version of the main result of Alzaid et al. (1987a) holds. 

Theorem 3.1. If rn < 1, then every sequence {r)j : j = 1,2, ...} is a stationary 
measure if and only if, for some non-null finite measure u on [0, 1), 



where, for each k, {6^^^ : j = 1,2,...} {uhth = 0) denotes the distribution 
relative to the probability generating function (B(-))* mth B{') as implied earlier 
{to be a unique probability generating function satisfying iS(0) = 0 and B{f{s)) = 
1 — rn -f viB{s), s € [—1, ij.) Moreover, if (3.1) is met with m < 1, then {r}j} is a 
stationary measure satisfying VjPo ~ generating function U such 

that U{po) = 1, if and only if, for some probability measure p on [0, 1), 

du{t) = K dp{t), te[0,l), (3.2) 

with K such that 

{ \ if c = 1 

r 

c" / exp{-Trr"~‘} dp(<) t/c€(0, 1). 

„=-oo Vl) 

The following theorem is of relevrmce to the topic of damage models especially 
in view of the results on damage models appearing in Talwalker (1980), Rao et al. 
(1980) and Alzaid et al. (1987a); this theorem is indeed a variation of Theorem 1 
of Alzaid et al. (1987a). 
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Theorem 3.2. Let c e (0,1) and {{v„,h„) : n = 0, 1,...} be a sequence of 2- 
vectors with nonnegative real components such that at IcMst one e„ is nonzero and 
hit is nonzero and hi < 1. Then 
oc 

Vkhf^ =Vj, j = 0,1 (3.3) 

*=o 

where, for each k > 0, {/«**'*} is the k-fold convolution of {/ij}, and {/ij'**} is the 
probability distribution that is degenerate at zero, if and only if, for some So > 0, 

Pj=hjs{,~\ i = 0, 1 (3.4) 

is a nondegenerate probability distribution, {VjSq : j = 1,2,...} is a stationary 
measure (not necessarily normalized as in Alzaid et al. (1987a)) relative to the 
general branching process with [pj] as in (3.4), and Vo = c(l — c)“* X)^i Vkh^. 

Theorem 3.2 is ea.sy to establish. 

Remark 3.1. If {/in} of Theorem 3.2 satisfies a further eonditioii that h„ = 0 for 
n > 2, then the assertion of the theorem holds with sq = atid the stationary 

measure in it satisfying (3.1) with &i = 1 and m = hi. Additionally, if we are given 
a priori that {vj] is of the form 

Vj=gjoA, j = 0.1,... 

with {gj} as a probability distribution and o > 0, then it is clear that (3.3) holds 
if and only if 

with /I as a probability measure on [0, 1). As an immediate consetjuenre of the latter 
result. Theorem 3 of Alzaid et al. (1987a) now follows. 

Remark 3.2. One can extend the main result of Alzaid et al. (1986) based on tbe 
Perron-FVobenius theorem in an obviotts way involving (in usual notation) 

P{T=r} = P{Y' = r\X' - Y' = ka) 

= P{Y" = r\X" -Y" = ko + ki], r = 0,l,... 

with k(i > 0 and fcj > 0, such that the survival distributions ttorresponding to 
(X,Y),(X' ,Y') and (X",Y") are not nece-ssarily the same but X i X' = X". 
This provides us with further insight into Theorem 3 of Alzaid et al. (1987a). (For 
an account of the Perron-Frobenius tbeorem with applications to Markov chains, 
see Seneta (1981).) 

Remark 3.3. Most of the results dealt with in this article also follow via alternative 
arguments based on Choquet’s theorem; for the details of this theorem, see Pheljw 
(1966). 

Remark 3.4. If we agree to rewrite the notation {/* as to take into account 
the value of the parameter c of the process, it easily follows (in obvious notation) 
that .given c < 1 and there exists an such that 

^t/(-,/s)oc (^^t/(;)(s)^/(l-s)''->/<‘"’''), S6(-1,1). (3.5) 
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However, it is worth noting here that there exist cases of (such as those with 

f/(*j)(s) = (ln(l -s))/(1h{t77)), s G (-1, 1)) for which (3.5) with c G (0, 1) is not met. 
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Abstract: VVe present a history of the development of the theory of Stochastic 
Integration, starting from its rtx)ts with Brownian motion, up to the introduc- 
tion of semimartingales and the independence of the theory from an underlying 
Markov process framework. We show how the developntent has influenced and 
in turn been influence<l by the development of Mathematical F'iiiance Theory. 

The calendar pericxl is from 1880 to 1970. 


The history of stochastic integration and the modelling of risky asset prices both 
begin with Brownian motion, so let ns begin there too. The earliest attempts to 
model Brownian motion mathematically can be traced to three sources, each of 
which knew nothing about the others: the first was that of T. N. Thiele of Copen- 
hagen, who effectively created a model of Brownian motion while studying time 
series in 1880 [81].^; the second was that of L. Bachelier of Paris, who created a 
model of Brownian motion while deriving the flynamic behavior of the Paris stock 
market, in 1900 (see, [1, 2, 11]); and the third was that of A. Einstein, who proposed 
a model of the motion of small particles suspended in a liquifl, in an attempt to 
convince other physicists of the molecular nature of matter, in 1905 [21](See [64] for 
a disciLssion of Einstein’s model and his motivations.) Of these three models, those 
of Thiele and Bachelier had little impact for a long time, while that of Einstein was 
immediately influential. 

We go into a little detail about what happened to Bachelier, since he is now 
seen by many as the founder of modern Mathematical Finance. Ignorant of the 
work of Thiele (which was little ai>preciated in its day) and preceding the work 
of Eiastein, Bachelier attempted to model the market noise of the Paris Bourse. 
Exploiting the ideas of the Central Limit Theorem, and realizing that market noise 
should be without memory, he reasoned that increments of stock prices should be 
independent and normally distributed. He combined his reasoning with the Markov 
property and semigroups, and connected Brownian motion with the heat equation, 
using that the Gaiussian kernel is the fundamental .solution to the heat equation. 
He was able to define other proces.ses related to Brownian motion, such as the 
maximum change during a time interval (for one dimensional Brownian motion), 
by using random walks and letting the time stejjs go to zero, and by then taking 
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^This was called to our attention by Ragnar Norberg, whom we thank, and the contributions 
of Thiele are detailed in a pa|>er of Hald [30]. 
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limits. His thesis was appreciated by his mentor H. Poincare, but partially due to 
the distaste of studying economics as an application of mathematics, he was unable 
to join the Paris elite, and he spent his career far off in the provincial capital of 
Bt^angon, near Switzerland in Eastern FVance. (More details of this sad story are 
provided in [11]). 

Let us now turn to Einstein’s model. In modern terms, Einstein assumed that 
Brownian motion was a stochastic process with continuoius paths, independent in- 
crements, and stationary Gaussian increments. He did not assume other reasonable 
properties (from the standpoint of physics), such as rectifiable paths. If he had 
assume<l this last property, we now know his model would not have existed as a 
process. However, Einstein was unable to show that the process he proposed actu- 
ally did exist as a mathematical object. This is understandable, since it was 1905, 
and the ideas of Borel and Lebesgue constructing measure theory were developed 
only during the first decade of the twentieth century. 

In 1913 Daniell’s approach to measure theory (in which integrals are defined 
before measures) appeared, and it was these ideas, combined with Fourier series, 
that N. Wiener used in 1923 to construct Brownian motion, justifying after the fact 
Einstein’s approach. Indeed, Wiener used the ideas of measure theory to construct 
a measure on the path space of continuous functions, giving the canonical path pro- 
j(H!tion process the distribution of what we now know as Brownian motion. Wiener 
and others proverl many properties of the patlis of Brownian motion, an activity 
that continues to this day. Two key properties relating to stochastic integration are 
that (1) the patlis of Brownian motion have a non zero finite quadratic variation, 
such that on an interval (.s, t), the quadratic variation is (t — s) and (2) the paths of 
Brownian motion have infinite variation on compact time intervals, almost surely. 
The second property follows easily from the first. Note that if Einstein were to have 
assumed rectifiable paths, Wiener’s < onstruction would have es.sentially proved the 
impossibility of such a model. In rtxrognition of his work, his construction of Brown- 
ian motion is often referred to as the Wiener process. Wiener also constructed a 
multiple integral, but it was not what is known today as the “Multiple Wiener In- 
tegral”: indeed, it was K. Ito, in 1951, when trying to understand Wiener’s papers 
(not an easy task), who refined and greatly improved Wiener’s ideas [36]. 

The next step in the groundwork for stochastic integration lay with A. N. Kol- 
mogorov. The beginnings of the theory of stochastic integration, from the non- 
finance perspective, were motivated and intertwined with the theory of Markov 
processes, in which Kolmogorov, of course, played a fundamental role. Indeed, in 
1931, two years before his famous book establishing a rigorous mathematical basis 
for Probability Theory lusing measure theory, Kolmogorov refers to and briefly ex- 
plains Bachelier’s construction of Brownian motion ([41], pp. 64, 102-103). It is this 
paper too in which he develops a large part of his theory of Markov processes. Most 
significantly, in this paper Kolmogorov showed that continuous Markov processes 
(diffusions) depend essentially on only two parameters: one for the speed of the drift 
and the other for the size of the purely random part (the diffusive component). He 
was then able to relate the probability distributions of the process to the solu- 
tions of partial differential equations, which he solved, and which are now known 
as “Kolmogorov’s etjuations.” Of course, Kolmogorov did not have the Ito integral 
available, and thus he relied on an analysis of the semigroup and its infinitesimal 
generator, and the resulting partial differential equations.^ 


^J. L. Doob (17) has complained that the PDE methods of Kolmogorov and Feller ustxl to study 
Markov processes have often l>een called “analytic”, whereas the methorl of stochastic differentials 
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After Kolmogorov we turn to the fascinating and tragic story of Vincent Doe- 
blin (born Wolfgang Dbblin) the son of the author Alfred Doblin, who wrote Berlin 
Ale^anderplatz for example. The Doblin family flwl the Nazis from Germany, first 
to Switzerland, and then to Paris. Wolfgang changed his name to Vincent Doe- 
blin, and became a FVench citizen, finishing his schooling there and being quickly 
recognized as an extraordinary mathematical talent. In the late 1920’s Probability 
Theory was becoming stylish among mathematicians, especially in the two centers, 
Moscow and Paris. Doeblin joined the probabilists, working on Markov chains and 
later Markov processes.'* Doeblin wanted to construct a stochastic process with 
continuous patlis that would be consistent with Kolmogorov’s analytic theory of 
transition probabilities for Markov processes. He ultimately developed a framework 
to study them which was prescient in regards to future developments. However Doe- 
blin was drafted, and he volunteered to go to the front. Before he went he sketched 
out his ideas and he put this work in the safe of the National Academy of Science of 
France, to be opened only by him or else after 100 years. As the Maginot line fell, 
to avoid sharing his ideas with the Nazis Doeblin first burned his notes, and then 
he took his own life. The academy safe was opened only in May 2000, at the request 
of his brother, Claude Doeblin. It was only then that the far reaching vision of his 
work became apparent. In those notes, he utilized the new concept of martingales 
proposed by .1. Ville only in 1939 [84] and understood the importance of studying 
sample paths, instead of relying exclusively on distributional properties. One idea 
he had was to run Brownian motion by a random clock: what is known today as a 
time change. The change of time was then related to the tliffusion coefficient, and 
in this way he was able to give a modern treatment of diffusions decades before it 
was developed otherwise. 

We turn now to Kiyosi Ito, the father of stochastic integration. We will not 
attempt to reproduce the beautiful summary of his work and contributions pro- 
vided in 1987 by S. R. S. Varadhan and D. W. Stroock [83], but instead give a 
short synopsis of what we think were key moments.® No doubt an attempt to es- 
tablish a true stochastic differential to be used in the study of Markov processes 
was one of Ito’s primary motivations for studying stochastic integrals, just as it 
was Ddblin’s before him, although of course Dbblin’s work was secret, hidden away 
in the safe of the French Academy of Science. Wiener’s integral did not permit 
stochastic processes as integrands, and such integrands would of course be needed 


introduced by Ito has in contrast been called “probabilistic”. Indeed, he writes, “It is considercxl 
by some mathematicians that if one deals w'ith analytic properties and expectations then the 
subject is part of analysis, but that if one deals with sample sequences and sample functions then 
the subject is probability but not analysis”. D(K>b then goes on to make his point convincingly 
that both methods are probability. (Doob’s criticism is likely to have been partially inspired by 
comments of the second author.) Nevertheless, we contend that the methods of Ito changetl the 
probabilistic intuition one develops when studying Markov processes. 

J. Doob references his fundamental work on Markov chains and Markov processes extensively 
in his book [17), for example. Paul Levy wrote of him in an article devoted to an appreciation of 
his work after his death: “Je crois pouvoir dire, i>our donner une id6e du niveau ou il convient de le 
situer, qu’on peut compter sur les doigts d’une seule main les mathematiciens qui, depuis Abel et 
Galois, sont morts si jeunes en laissant une oeuvre aussi important”. 'IVanslated: ’I can say, to give 
an idea of Doeblin’s stature, that one can count on the fingers of one hand the mathematicians 
who, since Abel and Galois, have died so young and left behind a body of work so important.’ 
See [44] 

^The second author is grateful to Marc Yor for having sent to him his beautiful article, written 
together with Bernard Bru (6]. This article, together with the comi>anion (and much more detailed) 
article (7], are the sources for this discu.ssion of Doeblin. In addition, the story of Doeblin has 
recently been turned into a book in biographical form [65). 

®The interested reader can also consult (66). 


Copyrighted material 


78 


R. Jarrow and P. Protter 


if one were to represent (for example) a diffusion as a solution of a stochastic differ- 
ential equation. Indeed, Ito has explained this motivation himself, and we let him 
express it: “In these papers^ I saw a powerful analytic method to study the transi- 
tion probabilities of the process, namely Kolmogorov’s parabolic eciuation and its 
extension by Feller. But I wanted to study the paths of Markov processes in the 
same way as Levy observed differential i^rocesses. Ob-serving the intuitive back- 
ground in which Kolmogorov derived his equation (explained in the introduction 
of the paper), I noticed that a Markovian particle would perform a time homoge- 
neous differential process for infinitesimal future at every instant, and arriverl at 
the notion of a stochastic differential equation governing the paths of a Markov 
process that could be formulated in terms of the differentials of a single differential 
proces.s” [37].® 

Ito’s first paper on stochastic integration was published in 1944 ([34]), the same 
year that Kakutani published two brief notes connecting Brownian motion and 
harmonic functions. Meanwhile throughout the 1940’s Doob, who came to proba- 
bility from complex analysis, saw the connection between .1. Ville’s martingales and 
harmonic functions, and he worked to develop a martingale based probabilistic po- 
tential theory. In addition, H. Cartan greatly advanced potential theory in the mid 
1940’s, later followed by Deny’s classic work in 1950. -All these ideas swirling around 
were interrelated, and in the 1940s Doob, clearly explained, for the first time, what 
should be the strong Markov property. A few years later (in 1948) E. Hille and 
K. Yosida independently gave the structure of .semigroups of strongly continuous 
operators, clarifying the role of infinitesimal generators in Markov process theory. 

In his efforts to model Markov processes, Ito constructed a stochastic differential 
equation of the form: 

dXt = (T{Xt)d\Vt + p{Xt)dt, 

where of course M ' represents a standard Wiener process. He now had two prob- 
lems: one was to make sense of the stochastic differential (T{Xt)dWt which he ac- 
complished in the aforementioned article [34].® The second problem was to connect 
Kolmogorov’s work on Markov processes with his interpretation. In particular, he 
wanted to relate the patKs of X to the transition function of the diffusion. This 
amounttxl to showing that the distribution of X .solves Kolmogorov’s forward tH|ua- 
tion. This effort resulted in his spectacular paper [35] in 1951, where he stated and 
proved what Ls now known as Ito’s formula: 

/(X,) = f(X,)dX, + i/"(X,)./|X, X|,. 

Here the function / is of course assumed to be C^, and we are using modern nota- 
tion.^® Ito’s formula is of course an extension of the change of variables formula for 

^Mere Ito is referring to the papers of Kolmogorov [41] and of Feller [26]. 

*Note that while Ito never mentions the work of Baehelier in his foreword, citing instead 
Kolmogorov, Levy, and Doob as his main influences, it is reasonable to think he was aware of 
the work of Baehelier, since it is referenced and explained in the key paper of Kolmogorov ((41)) 
that he lists as his one of his main itispirations. While we have found no direct evidence that Ito 
ever read Bachelier’s work, nevertheless Hans Follmer and Robert Merton have told the authors 
in private communications that Ito had indeed been influenced by the work of Baehelier. Merton 
has also publisher! this ol>servation; see |>age 47 of (5ll. 

®Here Ito cites the work of S. Bernstein (5) as well as that of Kolmogorov (41) and W. Feller 
[26] as antecedents for his work. 

*°The book by H. P. McKean, .Jr., published in 1969 [47], had a great influence in popularizing 
the Ito integral, as it was the first explanation of Ito’s and others’ relatetl work in book form. 
But McKean referred to Ito’s formula as llo’s lemma, a nomenclature that has persisted in some 
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Riemann-Stieltjes integration, and it reveals the difference between the Ito stochas- 
tic calculus and that of the classical path by path calculus available for continuous 
stochastic processes with paths of bounded variation on compact time sets. That 
formula is, of course, where A denotes such a process and / is 

df{A,) = r{A,)dAt. 

It can be shown that if one wants to define a path by path integral of the form 
HgdAa as the limit of sums, where II is any process with continuous sample paths, 
then as a consequence of the Banach Stcinhaus theorem A a fortiori has sample 
paths of bounded variation on compacts. (Sw, for example, [67).) Since Brownian 
motion has paths of unboundwl variation almost surely on any finite time interval, 
Ito knew that it was not possible to integrate all continuous stochi\stic proces.scs. 
One of his key insights was to limit his space of integrands to those that were, as 
he called it, non anticipating. That is, he only allows integrands that are adapted 
to the underlying filtration of rr-algebras generated by the Brownian motion. This 
allowed him to make use of the independence of the increments of Brownian motion 
to establish the i.sometry 



Once the isometry is established for continuous non-anticipating processes //, it 
then extends to jointly mefusurable non-anticipating proces.ses.^* 

J. L. Doob realized that Ito’s construction of his stochastic integral for Brown- 
ian motion did not use the full strength of the independence of the increments of 
Brownian motion. In his highly influential 1953 book [16] he extended Ito’s stochas- 
tic integral for Brownian motion first to proce.s.ses with orthogonal increments (in 
the sense), and then to proces.ses with conditionally orthogonal increments, that 
is, martingides. What he lux^ded, however, was a martingale M such that — F{t) 
is again a martingale, where the increasing process F is non-random. He established 
the now famoas Doob decomposition theorem for submartingakw: If Xn is a (dis- 
crete time) submartingale, then there ejcists a unique decomposition Xn = Mn +^n 
where M us a mariingale, and is a process with non-decr'easing paths, Ao = 0, 
and with the special measurability property that ,4„ is F„-i measurable. Since A/^ 
is a submartingale when M is a martingale, he needed an analogous decomposition 
theorem in continuous time in order to extend further his stochastic integral. As it 
was, however, he extended Ito’s isometry relation as follows: 



where F is non-decreasing and non-random, — F is again a martingale, and al.so 
the stochastic integral is also a martingale. (Sec Chapter IX of [16].) 

circles to this clay. Obviously this key theorem of Ito is much more im|>ortant than the status the 
lowly nomenclature “lemma” afTords it, and we prefer Ito’s own dc?scription: “formula”. 

** Indeed, this is how the thejory is presented in the little 1969 book of McKean (47). Unfortu- 
nately it is not as simple as McKean thought at this early stage of the theory, to determine exactly 
which processes are included in this procedure; the natural cr-algebra generated by the simple in- 
tegrands is today known as the predictable cr-algebra, and the predictably measurable processes 
are a strict subset of jointly measurable, non-anticipating prcx;esses. I’his jroint is clarified in (for 
example) the bcxrk of K. L. Chung and II. Williams (9j, p. 63. 
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Tims it became an interesting question, if only for the purpose of extending 
the stochastic integral to martingales in general, to see if one could extend Doob's 
decomposition theorem to submartingales indexed by continuous time. However 
there were other rea.sons as well, such as the development of probabili.stic potential 
theory, which began to parallel the development of axiomatic potential theory, 
especially with the publication of G. A. Hunt’s seminal papers in 1957 and 1958 
[31, 32, 33]. It took perhaps a decade for these papers to be fully appreciateil, but iti 
the late 1960’s and early 1970's they led to even greater interest in Ito’s treatment 
of Markov processes as solutions of stochastic differential equations, involving both 
Brownian motion and what is today known as Poisson random measure. 

The issue was resolved in two papers by the (then) young French mathemati- 
cian P. A. Meyer in 1962. Indeed, as if to underline the importance of probabilistic 
potential theory in the development of the stochastic integral, Meyer’s first pa- 
per, establishing the existence of the Doob decomposition for continuous time sub- 
martingales [52], Ls written in the language of potential theory. Meyer showed that 
the theorem is false in general, but true if and only if one assumes that the sul)- 
martingale has a uniform integrability property when indexed by stopping times, 
which he called “Class (D)”, clearly in honor of Doob. Ornstein had shown that 
there were submartingales not satisfying the Class (D) property**, and G. Johnson 
and L. L. Helms [40] quickly provided an example in print in 1963. using three di- 
mensional Brownian motion. Also in 1963, P. A. Meyer established the uniqueness 
of the Doob decomposition [53], which today is known as the Doob-Meyer decom- 
position theorem. In addition, in this .second paper Meyer provides an analysis of 
the structure of L* martingales, which later will prove essential to the full devel- 
opment of the theory of stochastic integration. Two years later, in 1965. ltd and 
S. Watanabe, while studying multiplicative functionals of Markov proces.ses, define 
local martingales [39). This turns out to be the key object needed for Doob’s original 
coiijwtiire to hold. That is. any submartingale X, whether it Ls of Class (D) or not, 
has a uni(|ue decomposition 

Xt = M, + At, 

where M is a local martingale, and ,4 is a non-decreasing, predictable process with 

•4o = 0. 

Returning however to P. A. Meyer’s original paper [52], at the end of the paper, 
as an application of his decomposition theorem, he proposes an extension of Doob’s 
stochastic integral, and thus a fortiori an extension of Ito’s integral. His space of 
integrands is that of “well adapted” processes, meaning jointly measurable and 
a<lapted to the underlying filtration of cr-algebras. He makes the prescient remark 
at the end of his paper that “it seems hard to show (though it is certainly tr\ie) that 
the full cla.ss of well adajjted proces.ses whose “norm” is finite has been attained 
by this proci-dure.” This anticipates the oversight of McKean six years later (see 
footnote 11), and it Ls this somewhat esoteric measurability issue that delays the 
full development of stochastic integration for martingales which have jumps, as we 
shall see. 

Before we continue our dLscitssion of the evolution of the theory of .stochastic 
integration, however, let us digress to discuss the developments in economics. It is 
curioiLS that Peter Bernstein, in his 1992 Ixtok [4], states “Despite its importance, 
Bachelier’s thesis was lost until it was redi.scovered quite b>’ accident in the 1950’s by 
Jimmie Savage, a mathematical statistician at Chicago.” He goes on a little later to 
say “Some time around 1954, while rummaging through a university library. Savage 

for example, [59], p. 823 
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clmnced upon a small book by Baohelicr, published in 1914, on speculation and 
investment.” We know however that Kolmogorov and also Doob explicitly reference 
Bachelier, and Ito certainly knew of his work too; but perhaps what was “last” was 
Bachelier’s contributions to economics.*^ Bernstein relates that Savage alerted the 
economist Paul Samuelson to Bachelier’s work, who found Bachelier’s thesis in the 
MIT library, and later remarked “Bachelier .seems to have had .something of a one- 
track mind. But what a track!” (73). See also [74]. 

After a decade of lectures around the country on warrant pricing and how stock 
prices must be random,^ ' Samuelson then went on to publish, in 1965, two papers 
of ground breaking work. In his paper [72] he gives his economics arguments that 
jjrices must fluctuate randomly, 65 years after Bachelier had assumed it! This paper, 
along with Faina’s [24] work on the same topic, form the basis of what has come 
to be know’ll as “the efficient market hypothesis.” The efficient market hypothesis 
caused a revolution in empirical finance; the debate and empirical investigation of 
this hypothesis is still continuing today (see [25]). Two other profound insights can 
be found in this early paper that subsequently, but only in a modified form, became 
the mainstay of option pricing theory. The first idea is the belief (postulate) that 
discounted futures prices follow' a martingale*''’. From this postulate, Samuelson 
proved that changes in futures prices were uncorrelated across time, a generalization 
of the random w'alk model (see [46], and also [13] ). The second insight is that this 
proposition can be extended to arbitrary functions of the spot price, and although 
he did not state it exiilicitly herein, this forebocU?s an immediate application to 
options. 

In his companion paper [71], he combinerl forces with H.P. McKean .Ir.*** (who 
the same year published his tome together w'ith K. Ito [38]) who wrote a math- 
ematical appendix to the paper, to show es.sentially that a good model for stock 
price movements is w'hat is today know'ii as geometric Brownian motion. Samuelson 
explains that Bachelier’s model failed to ensure that stock prices always be posi- 
tive, and that his model leads to alxsurd inconsistencies with economic principles, 
wdiereas geometric Brow'nian motion avoids these pitfalls. This paper also derived 
valuation formuhis for both European and American options.*^ The derivation was 
almost identical to that used nearly a decade later to derive the Black-Scholes for- 
mula, except that instead of invoking a no arbitrage principle to derive the valuation 
formula, he again postulated the condition that the discounted options payoffs fol- 
low a martingale (see [71] p. 19), from which the valuation formulae easily followed. 

*^It is possible that L. J. Sarage read Bachelier’s work because Doob’s book had appeared only 
one year earlier and had referenced it, and then he might have been surprise<l by the economics 
content of Bachelier’s work. But this is pure speculation. Also. Samuelson wrote in [73] (p. 6) that 
“this was largely lost in the literature, even though Bachelier does receive occasional citation in 
standard works in probability.” 

'■‘These lecture lea<l to other papers being published by researchers following up on Samuelson’s 
ideas, for example the renowntxl pa|«!r of Osborne [6'2j. 

**See the Theorem of Mean Percentage Price Drift on page 46 and the subsequent discussion. 

*®Samuel.son combined forces with McKean, and later R. C. Merton, because he did not feel 
comfortable with the newly developed stochastic calculus (see [4] p. 215). This insight was also 
confirmee! by private communications with R. C. Merton. 

This is the paper that first coined the terms “European” and “American” options. According 
to a private communication with R.C. Merton, prior to writing the paper, P. Samuelson went to 
Wall Street to discuss options with industry professionals. His Wall Street contact explained that 
there were two types of options available, one more complex - that could be exercised anj' time 
prior to maturiy, and one more simple - that could be exercised only at the maturity date, and that 
only the more sophisticated European mind (as opposed to the American mind) could understand 
the former. In respon.se, when Samuelson wrote the paper, he used these as prefixes rmd reversed 
the ordering. 
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The much later insiglits of Black, Scholes, and Merton, relating prices of options to 
perfect hedging strategies, is of course not discussed in this article. Furthermore, 
it is also noteworthy that within this paper, Samuelson and McKean determine 
the price of an American option by discovering the relation of an American option 
to a free boundary problem for the heat equation. This is the first time that this 
connection is made. Interestingly, Samuelson and McKean do not avail themselves 
of the tools of stochastic calculus, at least not explicitly. The techniques McKean 
iLses in his appendix are partial differential equations in the spirit of Kolmogorov, 
coupled with stopping times and the potential theoretic techniques pioneered by 
G. Hunt and developed by Dynkin. 

The final precursor to the Black, Scholes and Merton option pricing forrnulaes 
can be found in the paper of Samuelson and Merton [75]. Following similar math- 
ematics to [71], instead of invoking the postulate that discounted option payoffs 
follow a martingale, they derived this postulate as an implication of a utility maxi- 
mizing investor’s optimization decision. Herein, they showetl that the option’s price 
could be viewed as its discounted expected value, where instead of using the actual 
probabilities to compute the expectation, one uses utility or risk adjusted proba- 
bilities***. These risk adjusted probabilities later became known as “risk-neutral” 
or “equivalent martingale” probabilities. It is interesting to note that, contrary 
to common belief, this use of “eciuivalent martingale probabilities” under another 
guise predated the paper by Cox and Ross [12] by nearly 10 years. In fact, Mer- 
ton (footnote 5 page 218, [50]) points out that Samuelson knew this fact as early 
as 1953! Again, by not invoking the no arbitrage principle, this paper just mussed 
obtaining the famous Black Scholes formula. The first use of the no arbitrage prin- 
ciple to prove a pricing relation between various financial securities can be found 
in Modigliani and Miller [60] some eleven years earlier, where they showed the 
equivalence between two different firms’ debt and equity prices, generating the fa- 
mous M&M Theorem. Both Samuelson and Merton were aware of this principle, 
Modigliani being a colleague at M.I.T., but neither thought to apply it to this 
pricing problem until many years later. 

Unrelated to finance, and almost as an aside in the general tide of the devel- 
opment of the theory of stochastic integration, were the insights of Herman Ru- 
bin. At the Third Berkeley Symposium in 1955, Rubin gave a talk on stochastic 
differential equations. The following year, he presented an invited paper at the 
Seattle joint meetings of the Institute of Mathematical Statistics, the American 
Mathematical Society , the Biometric Society, the Mathematical Association of 
America, and the Econometrics Society. In this paper he outlined what was later 
to become D. L. Fisk’s Ph.D. thesis, which invented both tpiasimartingales and 
what is now known as the Stratonovich integral. To quote his own recollections, 
“I was unhappy with the Ito integral because of the lack of invariance with non- 
linear change of coordinate systems, no matter how smooth, and, observing that 
using the average of the right and left endpoints gave exactly the right results for 
the integral of XdX for any X (even di.scontimious), it .seemetl that this was, for 
continuous X with sufficiently good properties, the appropriate candidate for the 
integral. ..Quasimartingales seemed the natural candidate for the class of processes, 
but I did not see a clear proof. I gave the problem to Fisk to work on for a Ph.D. 
thesis, and he did come up with what was neerled” [69]. 

Indeed, in D. L. Fisk’s thesis [27], written under Rubin when he was at Michigan 
State University, Fisk developed what is now known as the Stratonovich integral, 

**See espe< iftlly expression (20) on page 26. 
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and he also coined the phrase and developed the modern theory of quasimartingales, 
later used by K. M. Rao [68] to give an elegant proof that a quasimartingale Ls the 
difference of two submartingales, and also us<?d by S. Orey [63] in a pa[)er extending 
the idea and which foreshadowed modern day semimartingales. Fisk submitted his 
thesis for publication, but the editor did not believe there was much interest in 
stochastic integration, again according to the recollections of Herman Rubin |69j. 
So Fisk dropped that part of the thesis and did not pursue it, publishing iiustead 
only the part on quasimartingales, which appeared as [28] . 

Returning now to the historical development of stocliastic integration, we men- 
tion that P. A. Meyer’s development of the stochastic integral in [52] is skeletal at 
best, and a more systematic development is next put forward by Philip|)e Courrege 
in 1963 [10], The motivation clearly arises from potential theory, and the (taper of 
Courrege is published not in a journal, but in the (at the time) widely circulated 
S&minaire Breloi-Choquet-Deny (Theorie du Potenliel). Many reasonable Markov 
processes, and in (tarticular those treated by Hunt ([31, 32, 33]), have the pro|)- 
erty that they are quasi-left continuous. That is, they have [>aths wliich are right 
continuous with left limits a.s., and if there is a Jump at a stop|>ing time T, then 
that time T must lx; totally inaccessible. Intuitively, T nuust come as a complete 
surprise. One can formulate the condition of quasi-left continuity in terms of the 
underlying filtration of cr-algebras of the Markov process as well. This seems to be 
a reasonable property for the filtration of a time homogeneous Markov (rrocess to 
have, and is satisfied for a vast collection of examples. 

It was natural for .someone working in (XJtential theory to make the assumption 
that the filtration is quasi-left continuous, and such an assumption has the fortuitous 
consequence to imply that if A is a submartingale and A = A/ -I- ,4 is its Dool>- 
Meyer decomposition, then A has continuous sample ()aths. What this means is 
that in the isometry 


-Ki' 



where A is the increasing process corresponding to the submart ingale A = A/^, 
one extends the Ito-Doob technique to general martingales, and the resultant 
increasing random process A has continuoas patlis. This, it turns out. greatly sim- 
plifies the theory. And it is precisely this assumption that Courrege makes. Courrege 
also works with integrands which have left continuous (ratiis, and he considers the 
space of processes t hat are measurable with respect to the ir-algebra they generate, 
on R X n, calling it pr(x:esses which are “fortement bien adapts". Thus Courrege 
had. in effect, pioneered the predictable a-algebra, although he did not use it as P. 
A. Meyer did, as we shall see. As it tunts out, if dA, is path by path absolutely 
continuous with respect to dt (this is asually written dAi « dt), almost surely, 
then there ends up being essentially no difference which cr-algebra one uses: the 
predictable cr-algebra, or the progressive (T-algebra, or even Jointly measurable 
adapted processes. However if A is merely continuous and does not lucessarily have 
absolutely continuous patlis a.s., then one needs at least the progressive rr-algebra. 
We now know that what happens is that the difference between one such process 
and its predictable projection is a process that has a stochastic integral which is 

*®The progr^sswe tr-aiyebm is defined later in the theory, and it has the property that if a 
process W, is progressively measurable, and if r is a finite valued stopping time, then Hr is Pr 
measurable. 
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the zero process a.s, and this is why it does not matter. (For a detailed explanation 
see Liptser and Shiryaev [45], or alternatively Chung and Williams [9]). 

One important thing that Courrege did not do, however, was to prove a change of 
variables formula, analogous to Ito’s formula for stochastic integration with respect 
to Brownian motion. This was done in 1967 in an influential paper of H. Kunita and 
S. Watanabe [42] . Whereas the ajjproach of Courrege was solidly in the tradition 
of Doob and Ito, that of establishing an i.sornetry, the approach jjioneered by 
M. Motoo and S. Watanabe two years later in 1965 was new: they treated the 
stochastic integral as an operator on martingales having specific properties, utilizing 
the Hilbert space structure of bj" using the Doob-Mej'er increasing proce.ss to 
inspire an inner product through the (juadratic variation of martingales. (See [61]). 
In the same paper Motoo and Watanabe established a martingale representation 
theorem which proved to be prescient of what was to come: they showed that all 
martingales clefimnl on a probability space obtained via the construction of a type 
of Markov process named a Hunt j>roce.ss (in honor of the fundamental papers of 
G. Hunt mentioned earlier)were generated by a collection of additive functionals 
which were also martingales, and which were obtained in a way now a.ssociated 
with Dynkin’s formula and "martingale j)roblems.” 

The important paper of Motoo and Watanabe, however, was quickly overshad- 
owed by the subsequent and beautifully written paper of H. Kunita and S. VV^atan- 
abe, published in 1967 [42]. Here Kunita and Watanabe developed the ideas on 
orthogonality of martingales pioneered by P. A. Meyer, and Motoo and Watanabe, 
and they developed a theory of stable spaces of martingales which has proved fun- 
damental to the theory of martingale representation, known in Finance as “market 
completeness.” They also clarified the idea of quadratic variation as a pseudo inner 
product, and u.sed it to prove a general change of variables formula, profoundly ex- 
tending Ito’s formula for Brownian martingales. The formula was clean and simple 
for martingales with continuous paths, but when it came to the general case (i.e., 
martingales that can have jump discontinuities in their sample paths) the authors 
retreated to the rich structure available to them in the Hunt proce.ss setting, and 
they expres.sed the jumps in terms of the Levy system of the underlying Markov 
process. (Levy .systems for Markov proce.sses, a structure which describes the jump 
behavior of a Hunt process, had only been developed a few years eai lier in 1964 by 
S. Watanabe [85], and extended much later by A. Benveniste and .1. .lacod [3]). This 
“retreat” must have seemed natural at the time, since stochastic integrals were, 
as noted previously, seen as intimately intertwined with Markov processes. And 
also, as an application of their change of variables formula, Kunita and Watanabe 
gave simple and elegant proofs of Levy’s theorem characterizing Brownian motion 
among continuous martingales via its quadratic variation process, as well as an ex- 
tension from one to N dimensions of the spectacular 1965 theorem of L. Dubins and 
G. Schwarz [18] and K. E. Dambis [14] that a large class of continuous martingales 
can be represented as time changes of Brownian motion. 

This remarkiible paper of Kunita and W^atanabe was quickly appreciated by 
P.A. Meyer, now in Strasbourg. He helped to start, with the aid of Springer- 
Verlag, the Seminaire de Probabilit.es, which is one of the longest running seminars 
to be published in Springer’s famed Lectxire Notes in Mathematics series. In the 
first issue, which is only Volume 39 in the Lectxire Notes series, he published four 
key papers inspired by the article of Kunita and Watanabe [54, 55, 56, 57].^^ In 

large number of the historically important works on stochastic integration were published 
in the Seminaire de Probability scries, and these papers have been recently reprinted in a new 
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these papers he made two important innov'ations: he went beyond the "inner prod- 
uct” of Kunita and Watanabe (wliicli is and was denoted < A', V >, and whicli 
is tied to the Doob- Meyer decomposition), and expanding on an idea of Austin 
for discrete parameter martingales he created the “square bracket” (le crochet 
droit) pseudo inner product, denoted [A',!^). Unlike the bracket process < X,Y >, 
which exists for all locally square integrable martingales (and therefore all con- 
tinuoius ones), the sq\iare bracket process exists for all martingales, and even all 
local martingales. This turned out to be important in later developments, such 
as the invention of .semimartingales, and of course is key to the extension of the 
stochastic integral to all local martingales, and not only locally square integrable 
ones. 

The second major insight of Meyer in these papers is his realization of the 
importance of the predictable tr-algebra. Going far beyond Courrege he realized that 
when a martingale also had patlis of finite variation (of necessity a martingale with 
jumps), the stochastic integral should agree with a path by path construction using 
Lebesgue-Stieltjes integration. He showed that this holds if and only if the integrand 
is a predictable process. Moreover, he was able to analyze the jum|xs i)f the stochastic 
integral, observing that the stochastic integral has the same jump behavior as does 
the Leliesgue-Stieltjes integral if the integrand is predictably measurable. This laid 
the gromidwork for the semimartingale theory that was to come a few years later. 

We should further note at this point that Meyer was able to discard the Markov 
process framework u-sed by Kunita and Watanabe in the first two of the four papers, 
and he established the general change of variables formula used today without u.s- 
ing Levy systems. Meyer then applied his more general results to Markov jirocesses 
in the latter two of his four papers. Again, this was natural, since one of Meyer’s 
primary interests was to resolve the many open questions raised by Hunt’s seminal 
papers. It was research in Markov processes that was driving the interest in sto- 
chastic integration, from Ito on, up to this point. Neverthehsw, Doob hiui Iregun 
to isolate the martingale character of processes independent of Markov processes, 
and Meyer’s approach in his classic paj)ers of 1962 and 1963 (already discu.s.sed [52] 
and [53]) was to use the techniques developed in Markov process potential theory 
to prove purely martingale theory results. 

The development of stochastic integration as recounted so far seems to be pri- 
marily centered in Japan and France. But ini|>ortant p<millel developments were 
occurring in the Soviet Union. The books of Dynkin on Markov proc«>sses appeared 
early, in 1960 [19] and in English as Springer Verlag books in 1965 [20]. The famed 
Moscow seminar (reconstituted at least once on October 18 and 19, 1996 in East 
Lansing, Michigan, with Dynkin, Skorohod, Wentzell, Freidlin, Krylov, etc.), and 
Girsanov’s work on transformations of Brownian motion date to 1960 and ear- 
lier [29].^* Stratonovich developed a version of the Ito integral which obeys the 
usual Riemann-Steiltjes change of variables formula, but sacrifices the martingale 
property as well as much of the generality of the Ito integral. [80] While popular 

volumo of the S^niinaire, with a small amount of commentary as v.-ell [23]. 

^^Girsanov's work extends the much earlier work lirst of Cameron and Martin [8], who in 1949 
transformed Brownian paths for both deterministic translations and also some random transla- 
tions, keeping the old and new distributions of the processes equivalent (in the sense of hav- 
ing the same sets of probability zero); these ideas were extended to Markov processes first by 
Mariiyama [48] in 1954, and then by Girsanov in 1960. It was not until 1974 that Van Schuppen 
and Wong [82] extended these ideas to martingales, followed in 1976 by P. A. Meyer [58] and in 
1977 Lenglart [43] for the current modern versions. See also (for example) pages 132-136 of [67] 
fur an exposition of the modern results. 

^^Indeed, the Stratonovich integral was not met with much excitement, in a book review of the 
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in some engineering circles, the Stratonovich integral seemed to be primarily a cu- 
riosity, until much later when it was shown that if one ajjproximates the patlis of 
Brownian motion with differentiable curves, the resultant integrals converge to the 
Stratonovich integral; this led to it being an intrinsic object in stochastic differential 
geometry (see, e.g., [22]). 

The jjrimary works of interest in the Soviet Union were the series of articles of 
Skorohod. Again mainly inspired by the developing theory of Markov processes, Sko- 
rohod generalized the ltd integral in ways startlingly parallel to those of Courrege 
and Kunita and Watanalje. In 1963 Skorohod. squarely in the framework of Markov 
proccs.ses and clearly inspired by the work of Dynkin, develo|>ed a stochastic in- 
tegral for martingales which is analogous to what Courrdge had done in FVance, 
although he u.sed changes of time |76j. In 1966, wliile studying additive functionals 
of continuous Markov processes, he developed the idea of quadratic variation of 
martingales, as well as what is now known as the Kunita- Watanabe inequality, and 
tbe same change of variables formula that Kunita and Watanabe established [77]. 
He extended his results and his change of variables formula to martingales with 
jumps (always only those defimal on Markov processes) in 1967 [79]. The jump 
terms in the change of variables formula arc cx])res.sed with the aitl of a kernel 
reminiscent of the Ldvy systems of S. Watanabe.^* 

We close this short history with a return to Prance. After the paper of Kunita 
and Watanabe, and after P. A. Meyer’s four papers extending their results, there 
was a hiatus of three years before the paj>er of C. Doleans-Dade and P. A. Meyer 
a])|)eared [15]. Prior to this paper the development of stochastic integration had 
been tied rather intimately to Markov processes, and was perhaps seen as a tool 
with which one covild more effectively address certain topics in Markov process 
theory. A key a.ssumption made by the prior work of H. Kmiita and S. Watanabe, 
and also of P. A. Meyer, was that the underlying filtration of n algebras was quasi 
left continuous, alternatively stated as saying that the filtration had no fixed times 
of discontinuity. Dolean.s-Dade and Meyer were able to remove this hypothesis, 
thus making the theorj' a purely martingale theory, and casting aside its relation 
to Markov proces.ses. This can now be seen as a key step that lerl to the explasive 
growth of the theory in the 1970’s and also in finance to the fundamental papers of 
Harri.son-Kreps and Harrhson-Pliska, towards the end of the next dwade. Last, in 
this same paper Doleans-Dade and Meyer coined the modern term semimartingale, 
to signify the most general process for which one knew (at that time) there existed 
a stochastic integral.^'* 

time Skorohod wrote *^rhe proposed integral, when it exists, may be expressed rather simply using 
the Ito integral. However the class of functions for which this integral exists is extremely narrow and 
artificial. Although some of the formulas are made more simple by using the symmetrized integral 
(while must of them are made more complicated which will be made clear from what follows), 
its u.se is extremely restricted by its domain of definition. Thus this innovation is completely 
unjustified.”(78] The Stratonovich integral was developed simultaneously by D. Fisk in the United 
States, as part of his PhD thesis. However it was rejected for publication as being too trivial. In 
the second half of his thesis he invents quasimartingales, and that half was indeed published [28|. 

^P. A. Meyer’s work ([54, 55, .56, 57]), which proved to be highly influential in the West, refer- 
ences Courrhge. Motoo and Watanabe, Watanabe, and Kunita and Watanabe. but not Skorohod, 
of whose work Meyer was doubtless unaware. Unfortunately this effectively left Skorohod’s work 
relatively unknown in the West for quite some time. 

^^As we will see in a sequel to this pa|»r, the description of semimartingales of Doleans-Dade 
and Meyer of 1970 turned out to be prescient, in the late 1970's C. Dellachcrie and K. Bichteler 
simultaneously proved a characterization of semimartingales: they showed that given a right con- 
tinuous process X with left limits, if one defined a stochastic integral in the obvious way on simple 
predictable processes, and if one insisted on having an extremely weak version of a bounded con- 
vergence theorem, then X was a fortiori a semimartingale. 
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martingale noise: Kalman filter with 
fBm noise 

L. Gawarecki* and V. Mandrekar^ 
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Abstract: Wo consider noivlincar filtering problem with Gaussian martingales 
as a noise process, and obtain iterative equations for the optimal HIter. We 
apply that result in the case of fractional Browian motion noise process and 
derive Kalman type equations in the linear case. 


1. Introduction 

Tho study of filtering of a stochastic process with a general Gaussian noise was 
initiated in [8]. In case the system satisfies a stochastic differential equation, we 
derived an iterative form for the optimal filter given by the Zakai equation ([3]). 
It was shown in [2] that in the case of a Gaussian noise, one can derive the FKK 
e<|uation from whicli one can obtain the Kalman filtering equation. However in order 
to obtain Kalman’s equation in the case of fractional Brownian motion (fBin) noise, 
we had to assume in [3] the form of the observation process, which was not intuitive. 
Using the ideas in [5], we are able to study the problem with a natural form of the 
ol««‘rvat ion process as in the classical work. In order to get such a result from the 
general theorj' we have to study the Bayes formula for Gaussian martingale noise 
and u.se the work in [5]. This is accomplished in Section 1. In Section 2, we obtain 
iterative equations for the optimal filter and in Section 3 we apply them to the case 
of fBm nois«'. 

The problem of filtering with system and observation processes driven by fBm 
was considered in [1]. However, even the form of the Bayes formula in this case is 
complicated and no iterative equations for the filter can be obtained. The Bayes 
formula in [8] is applicable to any system process and observation process with 
Gaiuisian noise. In order to get iterative equations in non-linear case we assume 
that the system process is a solution of a martingale problem. This allows ns to 
obtain an analogue of the Zakai and FKK equations. As a consequence, we easily 
derive the Kalman equations in the linear case. If the data about the “signal” are 
sent to the server and transmitted to AWACS, the resulting process has bursts [6]. 
We assume a particular form for this observation process (see equation (3)). In most 
cases, signal (missile trajectory, e.g.) is Markovian. 

The work completed by D. Fisk under the guidence of Profes.sor Herman Rubin 
has found applications in deriving filtering equations in the classical case [4]. 
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2. Bayes formula with Gaussian martingale noise 

Let iLS consider the filtering problem with a signal or system process {Xi, 
0 < t < T}, which is unobservable. Information about Xi is obtained by observing 
another process Yt, which is a function of Xt, and is corrupted by noise, i.e. 

Vi=/^(t,X)-hNt, 0<t<T, 

where 0{t, •) is measurable with respect to the <r- field , generated by the signal 
process {X*, 0 < s < f}, and the noise {Nt, 0 < t < T} \s independent of 
{Xt, 0 < t < T]. The observation <j-field = (r{Yg, 0 < s < t} contains all the 
available information about the signal X/. The primary aim of filtering theory is to 
get an estimate for Xt based on the <r-field . This is given by the conditional 
distribution fit of Xt given or, ecpiivalently, by the conditional expectation 
E [f{Xt) ) for a rich enough class of functions /. Since this estimate minimizes 
the squared error Icxss, fit is called the optimal filter. 

In [8] an expression for an optimal filter was given for {Nt, 0 < t < T}, a 
Gaussian process and X) G H{R), the reproducing kernel Hilbert space (RKHS) 
of the covariance R of the process Nt ([8]). Throughout we assume, without lo.ss of 
generality, that E{Nt) = 0. 

Let iLs assume that Nt = Mt, a continuous Gaiussian martingale with the co- 
variance function R\i. We shall first compute the form of H{R\i). As we shall be 
using this notation exclusively for the martingale Mt, we will drop the subscript 
A/ from now on and denote the RKHS of R by H{R). Let us also denote by m(f) 
the expectation EM^. Note that m{t.) is a non decreasing function on [0,T] and, 
abusing the notation, we will denote by m the associated measure on the Borel 
subsets B((0, T]). With this convention, we can write 

//(/?) = : (j{t) = jT ()*{u)d7n{u), 0 < t < T, g* G L^(m) 

The scalar product in H{R) is given by {gi,g2)n{n) =< </ii ^2 If we denote 

by H{R : t) the RKHS of /?|[o,t)x(o,t)> R follows from the above that 

H{R : t) = |(/ : g{s) = J g* (u) drn{u), 0 < s < t, g* G L^{m) 

It is well known (see (8], Section 2), that there exists an isometry tt between H{R) 
and sp^ [Mt, 0 < t < T}, which, in case M is a martingale, is given by 

7r(</) = f g*(u)dMu, 

Jo 

where the RHS denotes the stochastic integral of the deterministic function g* with 
respect to A/. The isometry 

TTt{g) : H{R : t) — > sp^*{A/g, 0 < s < t) 

is given by nt{g) = ^*(u)dA/„. 

Suppose now 

Yt= f li{s,X)dm{s)-h Ml, 

Jo 
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where li{s,X) is -measurable and h{‘,X) 6 L^{in). Then using Theorem 3.2 
of [8] we get the Bayes formula for an -measurable and integrable function 

g{T,x) 


PI IT rl\py\- J 9(T. x)eJ.' > fc dPoX-‘ 


( 1 ) 


3. Equations for non-linear filter with martingale noise 

In this section we derive the Zakai equation for the so-called “unconditional” 
measure- valued process. We follow the techniques developerl in (2). We assume that 
{Xt, 0 < f < T} is a solution of the martingale problem. Let C^(R”) be the space 
of twice continuously differentiable functions with compact support. Let 



for / 6 with hj{t,x) and (Ti^j{t,x) bounded and continuous. We assume 

that Xt is a solution to the martingale problem, i.e., for / € Cc(R”), 

f{Xt)~ f{Luf){X^)du 
Jo 

is an -martingale with respect to the measure P. Consider the probability space 
{Q X P'), where P' is a probability measiue given by 

c/F' = exp J fi{s, X)dYg ^ J dP. 

Then under the measure P\ the process Yt has the same distribution as Mt and is 
independent of Xt. In addition, PoX~^ = P'oX“L This follows from Theorem 
3.1 in (8). Define 

Qt(u;',u;) = exp if. h (s, X{uj')) dYg(u}) - 1/2 j (s, X{J)) dm{s)^ . 
Then, with a notation g{uj') = g{T, X{uj')), equation (1) can be written as 


E(g{T,X) 


f g(u/')Qt(oj', u>)dPo X * (a;') 
fat(u;',u;)dPoX-^(uj') ' 


For a function / € (^(R”), denote 

Then we get the following analogue of the Zakai equation. We assume here that m 
is mutually absolutely continuous with respect to the Lebesgue measure. 

Theorem. The quantity di{f,Y) defined above satisfies the equation 
‘‘d, (/(•), V) = b, (LJi), Y) dl + a, {h{t, •)/(•), Y) dY,. 
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Proof. follow the argument as in [2]. Consider = /(A't(w')) - 


with XI an independent copy of X( as a function of a;'. Using Ito’s formula. 


Since {gt,Y) = Ep{gtat), utilizing the Fubini theorem and Theorem 5.14 in [7], 
we rewrite the latter as 


It should be noted the application of Theorem 5.14 above is valid due to the fact that 
the martingale Mt is a time changed Brownian motion with non-singular time. □ 

Now we note that the oiitimal filter is given by 


Under our construction, Yt is a continuous Gaussian martingale with the increasing 
process m{t). Using Ito’s formula we obtain 


where Ut=Yt~ iis(/i) dm{s). 

4. Filtering equations in case of fractional Brownian motion noise 

Let us start with the definition of fractional Brownian motion (fBin). We .say that 
a Gaussian process , 0<<<T}ona filtered probability space P), 

with continuous sample paths is a fractional Brownian motion if = 0, EW/' = 
0, and for 0 < // < 1, 


S'!\L,f)(X,(u>'))ds, with / € C^(K"). Tlien 

F.p (g, \rf ) = 0 < ( < T. 


We can represent atif. r) ns 

Mf.y) 



Ep {gt{uj')at{iAj' ,oj)) 
<y't iat,Y). 


By definition of gt , 


dg, = {LJ){X',)dt, 


dat = ath{t,X') dYt. 




dtitif) = fit{Lif)dt + [n^(h./) - n,(/)nt(h)] d^t, 


( 2 ) 
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Let US set up some notation following (5j. 

s) = where kh = 2//r(3/2 - H)T{H + 1/2), 


— \-42-2rt 


with A// = 


A/,"= f ki,(t,s)d\V‘'. 
Jo 


2//r(3-2//)r(// + 1/2) 
r(3/2 - //) 


The integral with respect to fBin is cU^scribed in (9). The process A//^ is a 
Gaussian martingale. Defiru^ 


Cru(t)=-^£k„{t,s)C(s)ds, 


where C{t) is an .T^-adapted process and the derivative is understood to be in terms 
of absolute continuity. Then the following result can be derived from (5). 

Let Y, = /„' C(s, X)ds + W," . Then 

^t= f‘Q-Mdw'J +M," 

Jo 

is an semi-martingale and = Tf . Let ils now consider the filtering problem 
as in Section 1, with the noise and the observation proce.ss 


n= f c(s,x)dH + w!'. ( 3 ) 

Jo 

Then the etjuivalent filtering problem is given by the system proc(?ss Xt and the 
observation {process 

Zt= [ Q',A^,X)dw» + M>'. 

Jo 

Using results of Section 2, and as.suniiiig that Xt is a .solution to the martingale 
problem, equation (2) reduces to 


dh,{f) = fi,(L,f)dt + [fi,(%/) - n,(/)n,(Q'H)] d^,, 

where Ut = Z{t)— fls(Q'lf) By Theorem 2 in (5) we get that Ut is a continuous 
Gaassian .F/ -martingale with variance . 

Let us now assume that the system proce.ss and observation pro(;e.sse.s are given 

by 

Xt = f b{u)Xu dll + f a{ii) dWu 
Jo Jo 

Yt = [ c(u)X^du+VV\", 

Jo 

where the processes Mq and li'/^ are independent. Because (Xt,Zt) is jointly 
Gaussian wo get 


nt(x,x,)-fi,(Xt)ri,,(x«) 

= E Ux, - n,(jii:,))(A:. - n<(-V,)) } 

= e{(x, - n,(x,))(.v. - n,(A-,))} 

= r((,s). 
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VVe obtain that 

dnt(Xt) = b(t)h,(Xt)dt + f kHit,s)r{t,s)dsdu,. (4) 

Jo 

Denote by 7(f) = EXf, and F{t) = Then by the Ito formula for 

f{x) = and by taking the expectation, we get 

d'y{t.) = 2b{t)'){t)dt + a^{t)dt 

and dF{t) = 2b{t)F{t)dt kn{t,s)Y{t,s) d^ dwj^ . 

Let us consider 

r{t,t) = E{Xt-ti{Xt)f 

= E{Xf) - EitlUXt)) 

= 7(0 - F{t). 

Then we arrive at 

dr{t,t) = 2b{t)r{t,t)dt + a^{t)dt — kn{t,s)r{t,s) ds^ dw^ . (5) 

For H = ^ this reduces to the Kalman equation. 

Equations (4) and (5) give the Kalman filtering equations in the linear case. 
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Abstract; Self-similar stochastic processes are used for stochastic modeling 
whenever it is expected that long range dependence may be present in the 
phenomenon under consideration. After discu-ssing some basic concepts of .self- 
similar processes and fractional Brownian motion, we review some recent work 
on parametric and nonparametric inference for estimation of parameters for 
linear systems of stochastic differential equations driven by a fractional Brown- 
ian motion. 


1. Introduction 

“Asj inptotic Distributions in Some Nonregiilar Statistical Problems’' was the topic 
of my Ph.D. Dissertation prepared under the guidance of Prof. Herman Rubin at 
Michigan State University in 1966. One of the nonregular problems studied in the 
dissertation was the problem of estimation of the location of cusp of a continuous 
density. The approach adapted was to study the limiting distribution if any of 
the log- likelihood ratio proce.ss and then obtain the asymptotic properies of the 
maximum likelihood estimator. It turned out that the limiting process is a special 
type of a nonstationary gaiLssian process. The name fractional Brownian motion was 
not in vogue in those years and the limiting process is nothing but a functional shift 
of a fractional Brownian motion. Details of these results are given in Prakasa Rao 
(1966) and Prakasa Rao (1968). The other nonregular problems discussed in the 
dis.sertation dealt with inference under order restrictions where in it was shown that, 
for the existence of the limiting distribution if any for the nonparametric maximum 
likelihood density e.stimators under order restrictions such as unimodality of the 
density function or rnonotonicity of the failure rate function, one needs to scale the 
estimator by the cube root of n, the sample size rather than the square root of n 
as in the classical parametric inference (cf. Prakasa Rao (1969, 1970). The.se type 
of asymptotics are presently known as cube root a.symptotics in the literature. It 
gives me a great pleasure to contribute this paper to the festschrift in honour of 
my “guruvu” Prof. Herman Rubin. 

A short review of some properties of self-similar proces.ses is given in the Sec- 
tion 2. Stochastic differential equations driven by a fractional Brownian motion 
(fBm) are introduced in the Section 3. Asymptotic proj)erties of the maximum 
likelihood estimators and the Bayes estimators for parameters involed in linear 
stochastic differential equations driven by a fBm with a known Hurst index are 
reviewed in the Section 4. Methods for statistical inference such as the maxi- 
mum likelihood estimation and the sequential maximum likelihood estimation are 

Hiulian Statistical Institute, 7, S. J. S. Sansaiiwal Marg, New Delhi, 110016. e-mail: 
blspCisid.ac. in 

Keywords and phrases: self-similar process, fractional Brownian motion, fractional Ornstein- 
Uhlenbeck type process, Girsanov-type theorem, maximum likelihood estimation, Bayes estima- 
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discussed for the special case of the fractional Orastein-Uhlenl>eck type process 
and some new results on the method of minimum Z,i-norm estimation are pre- 
sented in the Section 5. Identification or nonparametric estimation of the “drift” 
function for linear stochastic systems driven by a fBm are studied in the Sec- 
tion 6. 

2. Self-similar processes 

Long range dependence phenomenon Is said to occur in a stationary time series 
{•^ni n > 0} if the Cov{Xo, X„) of the time series tends to zero as n — * oo and yet 
the condition 

nc 

53lCor(A’o,A’„)| = oc (2.1) 

n=0 

holds. In other words the covariance between Xo and X„ tends to zero but .so 
slowly that their sums diverge. This phenonmenon was first observed by hydrologist 
Hurst (1951) on projects involving the design of reservoirs along the Nile river (cf. 
Montanari (2003)) and by others in hydrological time series. It was recently observed 
that a similar phenomenon occurs in problems connected with traffic patterns of 
packet flows in high speed data net works such as the internet (cf. Willinger et al. 
(2003) and Norros (2003)). Long range dependence is also related to the concept of 
self-similarity for a stochastic process in that the increments of a self-similar process 
with stationary increments exhibit long range dependence. Long range dependence! 
pattern is also observed in macroeconomics and finance (cf. Henry and Zaiforoni 
(2003)). A recent monograph by Doukhan et al. (2003) discusses the theory and 
applications of long range dependence. 

A real-valued stochastic process Z = {Z{t),—oo < t < oc} is said to be self- 
similar with index // > 0 if for any a > 0. 

C({Z{at), -00 < t < ex:}) = £({a”Z(t), -oo < t < cx}) (2.2) 

where £ denotes the class of all finite dimensional distributions and the cxjuality 
indicates the equality of the finite dimensional distributions of the process on the 
right side of the equation (2.2) with the corresponding finite dimensional distrib- 
utions of the prex-ess on the left side of the equation (2.2). The index H is called 
the scaling exponent or the fractal index or the Hurst parameter of the process. If 
H is the scaling exponent of a self-similar process Z, then the process Z is called 
H-self similar process or H-ss process for short. It can be checked that a nondc!- 
generate H-ss process cannot be a stationary process. In fact if {Z{t),t > 0} is a 
H-ss process, then the process 

Y(t) = e-"'Z(e'), -oo < t < oo (2.3) 

is a stationary process. Conversely if T = (y(<), — oo < t < oo} is a stationary 
process, then Z = {t^y(logf),t > 0} is a H-ss process. 

Suppo.se Z = {Z{t),—oo < t < oo} Is a H-ss process with finite variance and 
stationary increments, that is, 

£(Z(t -t-k)- Z(t)) = C(Z(t) - Z(0)), -00 <t,h< 00. (2.4) 

Then the following properties hold: 
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(i) Z(0) = 0 a.s; 

(ii) If // 7 ^ 1, then E{Z{t)) = 0, — oo < t < oo; 

(iii) c{Z(-t)) = £(-z{()); 

(iv) £(Z2(,)) = 

(v) The covariance function ^^{1, s) of the process Z is given b}' 

= l{|(p» + |»p» - |i - Sp«}. (2.5) 

(vi) The self-similarity parameter, also called the scaling exponent or fractal index 

//, is less than or equal to one. 

(vii) If // = 1, then Z{t) = tZ{l) a.s. for -oo < t < oo. 

(viii) Let 0 < // < 1. Then the function 

= + ( 2 . 6 ) 

is nonnegative definite. For proofs of the above properties, see Taqqu (2003). 

A gaussian proce.ss H-ss process IV'^ = {Wn{t), —oo < t < oc} with stationary 
increments and with fractal index 0 < // < 1 is called a fractional Brownian motion 
(fBm). It is .said to be standard if V'ar(U'^(l)) = 1. For any 0 < H < 1, there exists 
a version of the fBm for which the sample paths are continuous with probability 
one but are not differentiable even in the L^-sense. The continuity of the sample 
paths follows from the Kolmogorov’s continuity condition and the fact that 

E\W"{t.2) - fF"(^i)r = E|IT"(l)r|f-2 - (2.7) 

from the property that the fBm is a H-ss process with stationary increments. We 
can choose a such that aH > 1 to satisfy the Kolmogorov’s continuity condition. 
Further more 

g| ” " '-p = £1M"'(1)"1I<2 - (iP"-"' (2.8) 

t2 — t\ 

and the last term tends to infinity as t 2 ^ since H < 1. Hence the paths of the 
fBm are not L'^-differentiable. It is interesting to note that the fractional Brownian 
motion reduces to the Brownian motion or the Wiener process, for the case when 
// = !. 

As was mentioned above, self-similar processes have been used for stochastic 
modeling in such diverse areas as hydrology, geophysics, medicine, genetics and 
financial economics and more re^cently in modeling internet traffic patterns. Recent 
additional applications arc given in Buldyrev et al. (1993), 0.ssandik et al. (1994), 
Percival and Guttorp (1994) and Peng et al.(1992, 1995a, b). It is important to 
estimate the coastant H for modeling purposes. This problem has been considered 
by Azais (1990), Geweke and Porter-Hudak (1983), Taylor and Taylor (1991), Beran 
and Terrill (1994), Constantine and Hall (1994), Feuverger et al. (1994), Chen et 
al. (1995), Robinson (1995), Abry and Sellan (1996), Comte (1996), McCoy and 
Walden (1996), Hall et al. (1997), Kent and Wood (1997), and more recently in 
Jensen (1998), Poggi and Viano (1998) and Coeurjolly (2001). 
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It was observed that there are some phenomena which exhibit self-similar be- 
haviour locally but the nature of self-similarity changes as the ijhenomenon evolves. 
It was suggest ed that the parameter H must be allowed to vary as function of time 
for modeling such data. Goncalves and Flandrin (1993) and Flandrin aiul Goncalves 
(1994) propose a class of processes which are called locally self-similai' with depen- 
dent scaling exponents and dis<aiss tlu?ir applications. Wang et al. (2001) dev(;lop 
j)rocedures using wavelets to construct local estimates for time varying scaling ex- 
I)onent H{t) of a locally self-similar process. 

3. Stochastic differential equations driven by fBni 

Let (12, (JF(), P) be a stochastic basis satisfying the usual conditions. The natural 

fitration of a process is understood as the P-completion of the liltration generated 
by this process. 

Let > 0} be a normalized fractional Brownian motion (fBm) 

with Hurst fjarameter H € (0, 1), that is, a gaussian procass with continuous sample 
imths such that = 0, £’(H'/^) = 0 and 

E{Wl'Wl^) = -b 2.2/^ - \s - t > 0,6- > 0. (3.1) 

Let us consider a stochastic proce.ss Y = {Yt,t > 0} defined by the stochastic 
integral 0 (}uat ion 

Yi= f C{s)ds+ [ B{s)clWi\t>0 (3.2) 

Jo Jo 

where C = {C{t),t > 0} is an (.F<)-adapt€>d process and B{t) is a nonvanishing 
nonrandom function. For convenience, we write the above integral tniuation in the 
form of a stochastic differential equation 

dYt = C{t)dt -f B{t)dW,^ ,/->() (3.3) 

driven by the fractional Brownian motion . The integral 

B{s)dW^' (3.4) 

is not a stochastic integral in the Ito sense but one can define the integral of a 
deterministic function with respect to the fBm in a natural sense (cf. Gripenberg 
and Norros (1996); Norros et al. (1999)). Even though the j^rocess Y is not a 
seminiartingale, one can associate a semimartingale Z = {Zt, t > {)} which is called 
a fundamental semimartingale such that the natural filtration (Zt) of the j>rocess Z 
coincides with the natural filtration (T«) of the process Y (Kleptsyna et al. (2000)). 
Define, for 0 < s < f, 



k„ = 2 H r(2 - H) m + i), 

(3.5) 

K//(f,s) = kjj^s^~‘\t- s)^~" , 

(3.(i) 

^ 2//F(3-2//) T{H + ^) 

" “ r(| - H) 

(3.7) 

w'/ = 

(3.8) 
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and ^ 

A/"= f KH{t,s)dW.",t>0. (3.9) 

Jo 

The process is a Gaussian martingale, called the fundamental martingale (cf. 
Norros et al. (1999)) and its quadratic variation < A//* >= Rirther more the 
natural filtration of the martingale coincides with the natural fitration of the 
fBm IV'”. In fact the stochastic integral 

B(s)dW^ (3.10) 

can be represented in terms of the stochastic integral with respect to the martingale 
A/”. For a measurable function / on [0,7’], let 

j rt 

Kj,(t,s) = -2H—J /(r)r"-i(r-.s)"-idr,0<s<< (3.11) 

when the derivative exists in the sense of absolute continuity with resjject to the 
Lebesgue measure(see Samko et al. (1993) for sufficient conditions). The following 
result is due to Kleptsyna et al. (2000). 

ThfHjrem 3.1. Let A/” be the fundamental martingale associated with the fBm 
VV” defined by (3.9). Then 

f f{s)dW>J = /' K{,(t, s)dMi',t e [0, T] (3.12) 

JO JO 

a.s. (P) whenever both sides are well defined. 

Suppose the sample patlis of the process > 0} are smooth enough (see 

Samko et al. (1993)) so that 



is well-defined where to” and kn are as defined in (3.8) and (3.6) respectively and 
the derivative is understood in the sense of absoulute continuity. The following 
theorem due to KlepUsyna et al. (2000) associates a fundamental semimartingale Z 
associate<i with the process Y such that the natural filtration (Zt) coincides with 
the natural filtration (A^j) of Y. 

Theorem 3.2. Suppose the .sample paths of the process Qh defined by (3.13) belong 
P-a.s to L^([0,T],du.'”) where te” is as defined by (3.8). Let the process Z = 
(Zt,t e [O.T]) be defined by 

Z, = f KH(t,s)B-\s)dY, (3.14) 

Jo 

where the function Kn{t,s) is as defined in (3.6). Then the following results hold: 
(i) The process Z is an (Ti) -semimartingale with the decomposition 


Z. = 



+ A/” 


(3.15) 
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where is the fundamental martingale defined by (3.9), 

(ii) the process Y admits the representation 

n= f K^,{t,s)dZ, (3.16) 

Jo 

where the function is as defined in (3.11), and 

(Hi) the natural fitrations of (Zt) and coincide. 

Kleptsyna et al. (2000) derived the following Girsanov type formula a« a conse- 
quence of the Theorem 3.2. 

Theorem 3.3. Suppose the a.Hsumptions of Theoi'ein 3.2 hold. Define 

7' t 

A„(T) = exp{- / -U Ql(,t)dwl'}. (3.17) 

Jo ^ Jo 

Suppose that E{Ah(T)) = 1. Then the measure P* — An{T)P is a probability 
measure and the probability measure of the process Y under P* is the same as that 
of the process V defined by 

Vt= f B{s)dW”,0 <t<T. (3.18) 

Jo 


4. Statistical inference for linear SDE driven by fBm 

Statistical inference for diffusion type processes satisfying stochastic differential 
ecjuations driven by Wiener processes have been studied earlier and a comprehen- 
sive survey of various methods is given in Prakasa Rao (1999a, b). There has been 
a recent interest to study similar problems for stochastic processes driven by a 
fractional Brownian motion for modeling stochastic phcmonena with possible long 
range dependence. Le Breton (1998) studied parameter estimation and filtering 
in a simple linear model driven by a fractional Brownian motion. In a recent pa- 
per, Kleptsyna and Le Breton (2002) studied parameter estimation problems for 
fractional Ornstein-Uhlenbeck type procc'ss. This is a fractional analogue of the 
Ornstein-Uhlenbeck process, that is, a continuous time first order autoregressive 
process X = {Xt > 0} which is the solution of a one-dimensional homogeneous 
linear stochastic differential e<iuation driven by a fractional Brownian motion (fBm) 
,t > 0} with Hurst parameter II € (1/2, 1). Such a process is the unique 
Gaussian process satisfying the linear integral ecpiation 

X,=e[ Xsds -h (TW,‘‘,t > 0. (4.1) 

Jo 

They investigate the problem of estimation of the parameters 6 and based on the 
observation {X,,0 < s < T} and prove that the maximum likelihood estimator 9t 
is strongly consistent as T — » oc. 

We now discuss more general classes of stochastic processes satisfying linear 
stochastic differential ecjuations driven by a fractional Brownian motion and review 
some recent work connected with the asymptotic properties of the maximum like- 
lihood and the Bayes estimators for parameters involved in such processes. We will 
also discu-ss some aspects of sequential estimation and minimum distance estimation 
problems for fractional Ornstein-Uhlenbeck t>'pe processes in the next section. 
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Lot US consider the stochastic differential equation 

dX{t) = [a{f.,X{t)) + eb{t,X{t))]dt + a{t)dW\^,X{0) = (),t>0 (4.2) 

where 9 € e C R,W = > 0} is a fractional Brownian motion with knovm 

Hurst parameter H and a(f.) is a positive nonvanishing function on [0,oo). In other 
words X = {Xt,t > 0} is a stochastic process satisfying the stochastic integral 
equation 


X{t)= f [a(s,X(s)) + 0 6(s,X(s))]ds+ f X(0) = 0, t > 0. (4.3) 

Jo Jo 


Let 


C{9, t) = a{t, X{t)) + 9 b{t,X{t)), t > 0 (4.4) 


and assume that the sample paths of the process ^ > 0} are smooth enough 

.so that the the process 


QH,o{t) = -r -77 [ - - '.^ (1.^,1 > 0 (4.5) 

dW Jo 

is well-defined where and K//(t,s) are fis defineti in (3.8) and (3.6) respectively. 
Suppo.se the sample paths of the process {Q//,6»,0 < t < T} belong almost surely 
to L^{[0,T],dtv[^). Define 


= /' ( > 0 . (4.6) 

Jo 

Then the process Z = {Zt,t >0} is an (.7^t)-semimartingale with the decomposition 

Zt= f QjfM^)dw" + M" (4.7) 

Jo 

where is the fundamental martingale defined by (3.9) and the proce.ss X admits 
the representation 

X, = f K^i(t,s)dZ, (4.8) 

Jo 

where the function K^(., .) is as defined by (3.11). Let be the measure induced 
by the process {Xt,Q < t < T} when 9 is the true parameter. Following Theorem 
3.3, we get that the Radon-Nikodym derivative of with respect to Pq is given 

fjpT fT 1 fT 

= <^xp( QH.e{s)dZs - - Q'], 0(s)dwl‘]. (4.9) 

Maximum likelihood estimation 

We now consider the problem of estimation of the parameter 9 ba.sed on the 
observation of the process X = {Xt,0 < t <T\ and study its asymptotic properties 
as T — ► 00 . 
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Strong consistency: 

Let Lr{0) denote the Riuion-Nikodyin derivative The maximum likelihood 
estimator (MLE) Or is defined by the relation 


Lt(0t) = sui>L;(®)- 

fl€B 


(4.10) 


We assume that there exists such a me<isiirable maximum likelihood estimator. 
Sufficient conditions can be given for the existence of such an estimator (cf. Lemma 
3.1.2, Prakasa Rao (1987)). 

Note that 

d f , ^a(n,X(.i)) , „ d f' , Ms,X(s))^ 


d 

dw\’ 

= Jx(t) + e.h(i).(itay) 


Then 


logZ,T(0) = /^(Ji(0 + e.h{t))dZ, - i + e.h(t)fdwi' (4.12) 

Jo ^ Jo 


and the likelihood equation is given by 


r h(t)dZt - f(Ji (I) + 0J2(t))Mf)dw" = 0. (4.13) 

Jo Jo 


Hence the MLE of 0 is given by 

J2(t)iiz, + ff Mt)Mt)dw{' 

ij JiiOdwl^ 

Let 0i) be the true parameter. Using the fact that 

,tZ, = (J,(t) + 0„J2it))dwl' + dM,". 

it can be shown that 

dPi 


(4.14) 


(4.15) 


dP‘ t' 1 r‘ 

^ = exp[(0 - 0„) jf J 2 (t)d.\f" - -(0 - 0,>f f J'i(t)dw;']. (4.16) 


'«0 

Following this representation of the Radon-Nikodym derivative, we obtain that 

cT 

/.,iriaai." 

(4.17) 


/o JiWdw',' 


Note that the quadratic variation < 2 > of the process Z is the .same as the 
(piadratic rariation < > of the martingale which in turn is erpial to . 

Tins follows from the relations (3.15) and (3.9). Hence we obtain that 


[«>f']-‘lim5:|Z,,„, -/,,„,]2 = 1 a.s [Pi 


0o\ 
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where is a partition of the interval (0, T] such that sup|tj"^, — tends to 
zero as n — • 00 . If the function a(t) is an unknown constant <t, the above property 
can be used to obtain a strongly consistent estimator of based on the continuous 
observation of the process X over the interval [0, T]. Here after we assume that the 
nonrandom function a(t) is known. 


We now discuss the problem of maximum likelihood estimation of the parame- 
ter 6 on the basis of the observation of the process X or equivalently the process Z 
on the interval [0, T]. The following result holds. 

Theorem 4.1. The maximum likelihood estimator 9r i^ strongly consistent, that 
is, 

0T Oo a.s [Pso] as T oo (4.18) 

provided 

J 2 (t)dw^ —> oo a.s [Pe„] as T — * oo. (4.19) 

Remark. For the case fractional Ornstein-Uhlenbeck type process investigated in 
Kleptsyna and Le Breton (2002), it can be checked that the condition stated in 
equation (4.19) holds and hence the maximum likelihood estimator 6r is strongly 
consistent as T — » oo. 



Limiting distribution: 

We now discuss the limiting distribution of the MLE Ot bs T —• oo. 

Theorem 4.2. Assume that the functions b(t, s) and ff(t) are such that the process 
{Rt,1^9}>sa local continuous martingale and that there exists a norming function 
It,t >0 such that 

fT 

/f < Rt >= It I J^ifldui^ tf* as T 00 (4.20) 

Jo 

where It 0 as T —> oo and j] is a random variable such that P(rj > 0) = 1. Then 
(ItRt, It<Rt>)^ {-nZ, rf) as T - oo (4.21) 


where the random variable Z has the standard normal distribution and the random 
variables Z and g are independent. 

For the proofs of Theorems 4.1 and 4.2, see Prakasa Rao (2003a). 

Theorem 4.3. Suppose the conditions stated in the Theorem 4.2 hold. Then 


7 

/f'(flr-»o) as t 


(4.22) 


where the random variable Z has the standard normal distribution and the random 
variables Z and g are independent. 


Remarks. If the random variable is a constant with probability one, then the 
limiting distribution of the maximum likelihood estimator is normal with mean 
0 and variance g~^. Otherwise it is a mixture of the normal distributions with 
mean zero and variance g~^ with the mixing distribution as that of g. The rate of 
convergence of the distribution of the maximum likelihood estimator Ls discussed 
in Prakasa Rao (2003b). 
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Bayes estimation 

Suppose that the parameter space 0 is open and A is a prior probability measure 
on the parameter space 0. Further suppose that A has the density A(.) with respect 
to the Lebesgue measure and the density function is continuous and positive in an 
open neighbourhood of ^o. the true parameter. Let 

O'T’ = Iffi-T — (4.23) 

Jo 

and 

fT 

0T = It < Hr >= It / JiitJdw" ■ (4.24) 

Jo 

We have seen earlier in (4.17) that the maximum likelihood estimator satisfies the 
relation 

a-j' = {Or — 6 o)I^^I3t- (4.25) 

The posterior density of 0 given the observation = {Xs,0<s<T}is given by 


p{0\X^) = 


/e 


Let us write t = Ij>^{6 — $t) and define 


p-{t\X'^) = lT piOT + thlX'^). 


(4.26) 


(4.27) 


Then the function p*{t\X^) is the posterior density of the transformed variable 
t = It^{0 - Ot)- Let 


= dP,JdP,„ 


J^0T + tlT 
dPA 

UT 


a.s. 


and 


It can be checked that 


/ OC 

t'T{t)X{§T + tlT)dt. 

■oo 


(4.28) 


(4.29) 


p-(t\x^) = CfW{t)\(eT + </r) 


(4.30) 


and 


\ogur{t) = It^oit{{0t + tlT - Oo) - (Or - Oo)] 

+ tiT — Oo)^ — {Ot — ^o)^] 


1 


= tQT — 2^ PT — t^Irp {Ot — 0o) 
= 


(4.31) 


in view of the equation (4.25). 
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Supi>ose that the convergence in the condition in the equation (4.20) holds 
almost surely under the measure and the limit is a constant tf > 0 with 
probability one. For convenience, we write P = rf. Then 

Pt P a.s [Pe„\ as T oo. (4.32) 

Further supitose that h'{t) Ls a nonnegative measurable function such that, for 
some ()<£</?, 

/ ■«^ 1 

A'(t)exp[--f*(/? — £)]d< < tx3 (4.33) 

■OO ^ 

and the maximum likelihood estimator 6r is strongly consistent, that is. 

0T — ► a.s [Pe„\ as T —> oa. (4.34) 


In addition, supposr; that the following condition holds for every e > 0 and 
<5 > 0 ; 

exp[— e/.^^] j K{uIf^)X(dr + u)du 0 a.s.[/’8„] as T — * oo. (4.35) 

J\u\>S 

Then we have the follwing theorem which is an analogue of the Bernstein - von 
Mi.ses theorem proved in Prakasa Rao (1981) for a class of processes satisfying a 
linear stochastic differential etjuation driven by the standard Wiener process. 

Theorem 4.4. Let the assumptions (4-32) to (4.35) hold where A(.) is a prior 
density which is continuous and positive in an open neighbourhood of Bq, the true 
parameter. Then 


As a consequence of the above theorem, we obtain the following result by choosing 
K(t.) = Itl"', for some integer m > 0. 

Theorem 4.5. Assume that the following conditions hold: 

(Cl) 0T^0o a.s [P»„] os T — 00 , (4.37) 

(C2) Pt^P> 0 a.s [Pfl,,] as T — oo. (4.38) 

Further suppose that 

(C3)X{.) is a prior probability density on 0 which is continuous and positive in an 
open neighbourhood of 9u, the true parameter and 



\e\"'X(e)dB < 00 


(4.39) 


for some integer m > 0. Then 

^Im^ l«l'"|p-(t|.V^) - (^y/^exp(-^pt^)ldi = 0 a.s [PaJ. (4.40) 
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In particular, choosing rn = 0, we obtain that 

lim f |p*(<|A’^) - = 0 a.s [P«„] (4.41) 

r— ooy_^ ijT i 

whenver the conditions (Cl), (C2) and (C3) hold. This is the analogue of the 
Bemstein-von Mises theorem for a class of diffusion processes proved in Prakasa 
Rao (1981) and it shows the asymptotic convergence in the Li-mean of the posterior 
derusitg to the normal distribution. 

For proofe of above results, see Prakasa Rao (2003a). 

As a Corollory to the Theorem 4.5, we also obtain that the conditional expec- 
tation, under of [/f '(®t ~ ®)1"‘ converges to the corresponding m-th alxjsolute 

moment of the normal distribution with mean zero and variance 

We define a regular Bayes estimator of 9, corresjmnding to a prior probability 
density A(fl) and the loss function L(9,<p), based on the observation X^, as an 
estimator which minimizes the posterior risk 

Bt{<!>)= f L(9,<t>)p(e\X'^)dff. (4.42) 

J—oo 

over all the estimators 0 of 0. Here L(0,d>) is a loss fimetion defined on 0x0. 

Suppose there exists a measurable regular Bayes estimator 6t for the |>ara- 
meter 9 (cf. Theorem 3.1.3, Prakasa Rao (1987).) Suppos<- that the loss function 
L(9, <t>) satisfies the following conditions: 

L(9,d>) = ((\9-<t>\)>a (4.43) 

and the function t(t) is nondecreasing for < > 0. An example of such a loss function 
is L{9,<j>) = \9 — <t>\. Suppose there exist nonnegative functions R{t), J(t) and G{t) 
such that 

(Dl) R(t)t{tlT) < G{t) for all T > 0, (4.44) 

(D2) R(t)i(tlT) - J{t) as r - oc (4.45) 

uniformly on bounded intervals of t. Ftirther sup|x>se that the function 

(D3) f J(f + /i)exp[-^/3f*]tft (4.46) 

has a strict minimum at /i = 0, and 

(D4)the function G(t) satisfies the conditions similar to (4.33) and (4.35). 

We have the following result giving the asymptotic properties of the Bayes risk 
of the estimator 9t. 

Theorem 4.6. Suppose the conditions (Cl) to (C3) in the Theorem 4-5 and the 
conditions (Dl) to (D4) stated above hold. Then 

If^(9T — 9r)—'0 a.s [P«„] as T oo (4.47) 

and 

lim R{T)Bt(9t) = lim R(T)Br(9r) (4.48) 

T— »oo T— »oo 

= K{t)exp\-^.if^]dt a.s [Pe„\. 
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This theorem caii be provecl by arguments similar to thase given in the proof of 
Theorem 4.1 in Borwanker et al. (1971). 

We have observed earlier that 

/f '(0T - Oo) - iV(0,/3-‘) as r ao. (4.49) 

As a con.sequenee of the Theorem 4.6, we obtain that 

0T —* Oo a.s [Pe„) as r — * 00 (4.50) 

and 

/f '(A t - Oo) N(O,0-') as T ^ oo. (4.51) 

In other words, the Bayes estimator is asymptotically normal and has asymptot- 
ically the same distribution as the maxiumum likelihood estimator. The asymptotic 
Bayes risk of the estimator is given by the Theorem 4.6. 

5. Statistical inference for fractional Ornstein-Uhlenbeck type process 

In a recent paper, Kleptsyna and Le Breton (2002) studied parameter estimation 
problems for fractional Ornstein-Uhlenbeck type process. This is a fractional ana- 
logue of the Ornstein-Uhlenbeck process, that is, a continuous time first order au- 
toregressive process X = {Xt,t > 0} which is the solution of a one-dimensional 
homogeneous linear stochastic differential equation driven by a fractional Brown- 
ian motion (fBin) = {Wl^ ,t > 0} with Hurst parameter H € (1/2, 1). Such a 
process is the unique Gaussian process satisfying the linear integral equation 

X, = e f X,ds + > 0. (5.1) 

Jo 

They investigate the problem of estimation of the parameters 9 and based on 
the observation {A',,0 < s < T} and prove that the maximum likelihood (sti- 
mator Or is strongly consistent as T — ► oo. It is well known that the .secpiential 
estimation methods might lead to equally efficient estimators, as compare<l to the 
maximum likelihood estimators, from the process observed po.ssibly over a shorter 
ex[)ected period of observation time. Novikov (1972) investigated the asymptotic 
properties of a sequential maximum likelihood estimator for the drift parameter 
in the Ormstein-Uhlenbeck process. Maximum likelihood estimators are not robust. 
Kutoyants and Pilibossian (1994) developed a minimum Li-norm estimator for the 
drift parameter. We now discuss the asymptotic properties of a sequential maximum 
likelihood estimators and minimum Li-norm estimators for the drift parameter for 
a fractional Ornstein-Uhlenlreck typ>e process. 

Maximum likelihood estimation 
Let 

J M 

Kii(t,s) = H(2H — 1)— I — s)^~idr,0 < s < t. (5.2) 

ds J, 

The sample paths of the process (Xt, t > 0} are smooth enough so that the process 
Q defined by 

Q(t) = ^ K«(t.s)X,ds,t e [0,r] (5.3) 
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is well-defined where and «://(<, s)// are as defined in (3.8) and (3.6) respectively 
and the derivative is understood in the sense of absolute continuity with respect to 
the measure generated by . More over the sample paths of the process Q belong 
to L^{[0,T],dw^) a.s. (P). Define the process Z as in (4.6). 

As an application of the Girsanov type formula given in Theorem 3.3 for the 
fractional Brownian motions derived by Kleptsyna et al. (2000) , it follows that the 
Riulon-Nikodym derivative of the measure Pj , generated by the stochastic process 
X when 9 is the true parameter, with respect to the measure generate<l by the 
process X when 0 = 0, is given by 

fipT fT 1 rT 

^^=exp[6»y^ Q{s)(iZ^ - -9'^ (f{s)dwi^\. (5.4) 

Further more the quadratic variation < Z >r of the process Z on [0, T] is equal 
to a.s. and hence the parameter can be estimated by the relation 

limE[Z (n) — Z ^n)]^ = <T^ Wy a.s. (5.5) 

where is an appropriate partition of [0,Tj such that 

-^0 

t 

as n — » oo. Hence w'e can estimate almost surely from any small interval as 
long as we have a continuous observation of the process. For further di.scu.ssion, we 
assume that (t^ = 1. 

We consider the problem of estimation of the parameter 9 based on the oliserva- 
tion of the process X = {Xt ,0<t< T} for a fixed time T and study its asymptotic 
properties as T — ► oo. The following results are due to Kleptsyna and Le Breton 
(2002) and Prakasa Rao (2003a). 

Theorem 5.1. The maximum likelihood estimator 9 from the observation X = 
{Xt,0 < t <T) is given by 

T T* 

er = {l Q'H>)dw»}-' f Q(,s)dZ.. (5.6) 

Jo Jo 

Then the estimator 9 t is strongly consistent asT —* oo, that is, 

lim §r = 9 a.s. [Pq] (5-7) 

T— »oc 

for every 9 & R. 

We now di.scuss the limiting distribution of the MLE §r as T — + oo. 

Theorem 5.2. Let 

T 

Rt= f Q{s)dZ,. (5.8) 

Jo 

Assume that there exists a norming function It,t >i) such that 

fT 

Ij I Q^{t)dwf^ ^ as T oo (5.9) 

Jo 
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where It —* 0 as T — » oo and rj is a random variable such that P{i) > 0 ) = 1 . Then 

{ItRt^It < Rt >) ^ ivZ,rf)as T — > oo (5.10) 

where the random variable Z has the standard normal distribution and the. random 
variables Z and -q are independent. 

Observe that 

Applying the Theorem 5.2, we obtain the following result. 

Theorem 5.3. Suppose the conditions stated in the Theorem 5.2 hold. Then 

It^{0t - Oo) 4 - O.S T 00 (5.12) 

n 

where the random variable Z has the standard normal dvftribution and the random 
variables Z and q are independent. 

Remarks. If the random variable 7 / is a coiLstant with probability one, then the 
limiting distribution of the maximum likelihood estimator is normal with mean 0 
and variance Otherwise it is a mixture of the normal distributions with mean 
zero and variance q~^ with the mixing distribution as that of q. Berry-E)sseen 
type bound for the MLE is discussed in Prakasa Rao (2003b) when the limiting 
distribution of the MLE is normal. 


Sequential maximum likelihood estimation 

We now consider the problem of .sequential maximum likelihood estimation of 
the parameter 6. Let h be a nonnegative number. Define the stopping rule r(/i) by 
the rule ^ 

r(/j) = inf{< : f Q^{s)dWg > h}. (5.13) 

Jo 

Kletpt.syna and Le Breton (2002) have shown that 

lim f Q^{s)dWg = +oo a.s. [Pe] (5.14) 

‘—ce Jo 


for every 0 € R. Then it can be shown that P 0 (r(h) < oo) = 1. If the process 
is observed up to a previuosly determined time T , we know that the maximum 
likelihood estimator is given by 

T T 

er = {f Q\s)dw")-' I Q(s)dZ,. (5.15) 

Jo Jo 

The estimator 

Hh) = Kw (5.16) 

.T(h) .T(h) 

= {/ Q^{s)dw«]-' Q(s)dZ. 

Jo Jo 

= h-^ Q{s)dZ, 

Jo 
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is called the sequential maximum likelihood estimator of 0. VV'e now study the as- 
ymptotic properties of the estimator 0(h). 


The following lemma is an analogue of the Cramer-Rao ine<iuality for se<iuential 
plans (t(X),0t(X)) for estimating the parameter 6 satisfying the property 

Ee[0r(X)}^0 (5.17) 

for all 6. 


Lemma 5.4. Suppose that differentiation under the integral sign with respect to 0 
on the left side of the equation (5.17) is permi.s.sible. Further suppose that 



Q‘‘(s)dw ^ } < oc 


(5.18) 


for all 0. Then 

rrtX) 

Vare{0r(X)} > {Ee[ Q^(s)dw',' }~' (5.19) 

Jo 

for all 0. 

A sequential plan (t(X),0t(X)) is said to be efficient if there Ls equality in 
(5.19) for all 0. We now have the following result. 

Theorem 5.5. Consider the fractional Omstein-Uhlenbeck type process governed 
by the stochastic differential equation (5.1) with a = 1 driven by the fractional 
Brownian motion tvith H € [ 5 , 1). Then the sequential plan (r(h), 0(h)) defined 
by the equations (5.13) and (5.16) has the following properties for all 0. 

(i) 0(h) =0 r^h) w normally distributed urith Es(0(h)) = 0 and Vare(0(h)) = h 

(ii) the. plan is efficient: and 

(iii) the plan is closed, that is. P$(r(h) < 00 ) = 1. 

For proof, see Prakasa Rao (2004a). 


Minimum Li-norm estimation 

In spite of the fact that maximum likelihood estimators (MLE) are consistent 
and asymptotically normal and also asymptotically efficient in general, they have 
some short comings at the same time. Their calculation is often cumbersome as the 
expression for the MLE involve stochastic integrals which need good approximations 
for computational purposes. Further more the MLE are not robust in the sen.se 
that a slight perturbation in the noLse component will change the properties of the 
MLE sulwtantially. In order to circumvent such problems, the minimum distance 
approach is propased. Properties of the minimum distance estimators (MDE) were 
discussetl in Millar (1984) in a general frame work. 

We now obtain the minimum Lj-norm estimates of the drift parameter of a frac- 
tional Ornstein-Uhlenbeck type process and investigate the a.symptotic pro[>ertit>s 
of such estimators following the work of Kutoyants and Pilibossian (1994). 

We now consider the problem of estimation of the parameter 0 based on the 
observation of fractional Ornstcin-Ulilenbeck type process A = {X(,0<t<T} 
satisfying the stochastic differential equation 

dXt = 0X(t)dt + ed\\'l',Xo = xo ,0 <I<T (5.20) 
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for a Rxed time T where 6 € O C R and study its asymptotic properties as f — * 0. 
Let Xt(0) be the solution of the above differential equation with e = 0. It is 


obvious that 

x,(0) = ioe*‘,O<f <r. 

(5.21) 

Lpt 




St{0) = / \x,~ x,(e)\dt. 
Jo 

(5.22) 


We define 0' to be a minimum L\-norm estimator if there exists a measurable 
selection such that 

Sr((?;)= mf St(0). (5.23) 

Conditions for the existence of a measurable selection are given in Lemma 3.1.2 
in Prakasa Rao (1987). We assume that there exists a measurable selection 6‘ 
satisfying the above equation. An alternate way of defining the estimator 0' is by 
the relation 

= arg inf f lAt - X((0)|dt. (5.24) 


Consistency: 

Let W-ff' = supo<« 7 - 1^1- The self-similarity of the fractional Brownian mo- 
tion IV\" implies that the random variables and a^'Wi have the same prob- 
ability distribution for any a > 0, Further more it follows from the self-similarity 
that the supremum process has the property that the random variables 
and a^lf'/^* have the .same probability distribution for any n > 0. Hence we have 
the following observation due to Novikov and Valkeila (1999). 

Lemma 5.6. Let T > 0 and the process (H',^,0 < t < T) be a fBm uHth Hurst 
indci H. Let = supo<t<r B Then 

)P = K(p, H)T^" (5.25) 

for every p > 0, where K{p.H) = £(11’,^*)’’. 

Let 6() denote the true parameter, For any <5 > 0, define 

lo “ Xt(eo)\dt. (5.26) 

\8—0q\>o Jq 

Note that g(S) > 0 for any <5 > 0. 

Theorem 5.7. For every p > 0, there exists a positive constant K(p, H) such that, 
for every <5 > 0, 

< 2'T'’"+'’A'(p,//)el«“l^"(p(tf))-''eP (5.27) 

= 0((p(<5))-'-£'’). 
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Proof. Let ||.|| denote the Li-norni. Then 

- 9»l > <*) = inf ^ IIX - xWII > _inf ||X - x(tf)ll) 


|«-0o|<^ 




2 inf (ll^-2^W))ll + IW»)-^Wi)ll) 

\O—0i}\<o 

> inf (||x(«) - x(«„)|| - \\X - x(fl„)||)) 

= P^;'{2||X-x(fl„)||> inf ||x(9)-i(e„)||} 

|0 — 0o|>^ 

= <’(l|Jf-:>^(»o)ll>59('S)}- 


(5.28) 


Since the process Xt satisfies tlie stochastic differential equation (5.20), it follows 


that 


~ ^t(^o) — Xo + 9o f Xgds + eW/^ — Xt{0o) 

Jo 

= On [ {X,-x,{en))ds + eW/^ 

Jo 


(5.29) 


since Xt{6) = Let Ut = Xt — Xt{9n)- Then it follows from the above ecjuation 

that 


Ut = On f U, d.s + eWl’. 

Jo 

Let Vi = |l7t| = — Xf(^o)|. The above relation implies that 


(5.30) 


Vi = |X, - < |fl„| f v,<is + £|iv»|. 

Jo 


Applying Gronwall-Bellman Lemma, we obtain that 

sup \Vt\<ee^^'^U 

0<t<T 0<t<T 


(5.31) 


(5.32) 


Hence 


pj'>|||A--x(e„)||>5.<»('5)| < p| sup |»V'| - 

I ^ J t o<t<r 


(5.33) 


^ 2eT J 

Applying the Lemma 5.6 to the estimate obtainerl above, we get that 

-^o| ><5} < 2'TP"+''A'(p,//)e'®“^|P(5(<5))-PeP (5.34) 




□ 


Remarks. As a consequence of the above theorem, we obtain that 0* converges in 
probability to On under Pq‘^ - meaame as e — > 0. Further more the rate of convergence 
is of the order (0(£^)) for every p > 0. 
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Asymptotic distribution 

We will now study the asymptotic distribution if any of the estimator after 
suitable scaling. It can be checked that 



X, = + / e-*»'e<nv’«} 

Jo 

(5.35) 

or equivalently 


X, - xt(0o) = ee*»' / e-*°’dW,^. 
Jo 

(5.36) 

Let 


Yt=e^°‘ f 
Jo 

(5.37) 


Note that {Vi,0 < < < T} is a gaussian process and can be interpreted as the 
“derivative” of the process {X|,0 < < < T} with respect to e. Applying Theorem 
3.1, we obtain tlrat, P-a.s., 


= / K{j(t,s)dM",te[Q,T] (5.38) 

Jo Jo 

where f{s) = 6 [0, T] and A/” is the fundamental gaussian martingale 

associated with the fBm W'^ . In particular it follows that the random variable 
Vie”*'’' and hence V) has normal distribution with mean zero and further more, for 
any h >0, 

Cov{Yt,Y,+H) = e-^'"’dW^] (5.39) 

Jo Jo 

= e**‘’'+*‘>'‘//(2// - 1) / / e"*“<“+">|ji-t;p"”*du(lr 
Jo Jo 

= e*'’»‘+*‘>'‘7H(«) (say). 

In particular 

Var{Yt) = e"“S//(f). (5.40) 

Hence (V|,0 < t < T) is a zero mean gaussian process with Cov{Yi,Y,) = 
e*“**'''*)'y//(<) for s >t. 

Let 

fT 

(^ = arg inf / ly) — ufioP*°'|dL (5.41) 

-00<«<00 Jq 

Theorem 5.8. .4s e — > 0, the random variable — 0o) converges in prob- 

ability under the probability measure Pe„ to a random variable whose probability 
distribution is the same as that of the random variable ^ under Pg„. 

Proof. Let ij(0) = Xote^* and let 

Z,{u) = ||T - s-'{x(6o + £u) - z(0o))|| (5.42) 

and 

Zo(u) = \\Y - ux'(0o)||. (5.43) 
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Further more, let 


^ = {w : |e; - e„| < = e^r 6 (i, 1), L. = (5.44) 

Observe tlmt the random variable u* = £“*(®e ~ ^o) satisfies the equation 

Z,(ul)= inf Z,{u),uj€A,. (5.45) 

|a|<t. 

Define 

C, = arg inf Zo(«). (5.46) 

Observe that, with probability one, 

sup |/,(u) - Z„(u)| = |||r-«x'(0o)- Je«V'(e)l|-||V-mr'(0„)||l 

|u|<te 4 

< .sup f \x"{0)\dt 

4 |«-«0<i. Jo 

< (5.47) 

Here 0 = 0^+ a(0 — So) for some a e (0, 1). Note that the last term in the above 
inequality tends to zero as e — ► 0. Further more the process (2o(u), -oo < u < oc} 
has a unique minimum u* with probability one. This follows from the arguments 
given in Theorem 2 of Kutoyants and Pilibossian (1994). In addition, we can ch(X)se 
the interv'al \—L,L\ such that 

Pjf K G (-L, T)} > 1 - m-” (5.48) 

and 

P{u* € (-L, T)} > 1 - iJg(L)-f‘ (5.49) 

where /? > 0. Note that g(L) increases as L increases. The proces.ses Zt(u),u 6 
{-L, L] and Zq(u), u G [-L, L] satisfy the Lipschitz conditions and Zc{u) converges 
imiformly to Zo(u) over u 6 [— T, L], Hence the minimizcr of Zc(.) converges to the 
minimizer of Zo(u). This completes the proof. □ 

Remarks. We have seen earlier that the process {T(,0 < ( < T) is a zero mean 
gaussian process with the covariance function 


Con(y;,r,) = e®<>('+*)7„(t) 


for a > I. Recall that 

fT 

(( = arg inf / |V( — utxoe*“‘|d<. (5.50) 

— oo<u<oc Jq 

It is not clear what the distribution of C is. Observe that for every u, the integrand 
in the above integral is the absolute value of a gaussian process {J(,0 < t < T} 
with the mean function E(Jt) = — ufxoc®”* and the covariance function 


Coe(Je,J,) = e»“»'+-)7//(<) 


for s > f. 
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6. Identification for linear stochasic systems driven by fBni 

We now discuss the problem of nonparametric estimation or identification of the 
■‘drift” function 0(t) for a class of stochastic processes satisfying a stochastic differ- 
ential equation 

dXi = e(t)Xtdt + dll'" , Xo = T, t > 0 (6.1) 

where r is a gaussian random variable independent of the process which 

is a fBm with known Hurst parameter. We use the method of sieves and study 
the asymptotic properties of the estimator. Identification of nonstationarj’ diffusion 
luodeLs by the method of sieves is studied in Nguyen and Pham (1982). 

Estimation by the method of sieves 

We assume that 0(t) € L^([0,T],dt). In other words X = {Xi,t > 0} is a 
stochastic process satisfying the stochastic integral equation 

X(f) = r-(- / 0(s)X[s)ds + W/^,O<t<T. (6.2) 

Jo 


where 0{t) e L'^{[Q,T],dt). Let 

C«(t) = 9(t) X(t), 0 < t < T (6.3) 

and assume that the sample paths of the process {C#(<),0 < t < T} are smooth 
enough so that the process 

<yH,»(<)=^^y Kn(t,s)Ce{s)ds,0<t <T (6.4) 

is well-defined where and K//(t,.s) are as defined in (3.8) and (3.6) respectively. 
Suppose the sample patlis of the process {Q//(f),0 <t<T] belong almost surely 
to L2((o,r],d«>»). Define 


Zt= f Kn(t,s)dX„0 <t <T. (6.5) 

Jo 

Then the process Z = {Zt,0 < f < T} is an (.F()-semimartingale with the decom- 
position 

Z, = [ QH.»(^)dw" + A/« (6.6) 

Jii 

where is the fundamental martuigale defined by (3.9) and the process X admits 
the representation 

X, = X»+ [ K„(t,s)dZ. (6.7) 

Jo 

where the function K/j is as defined by (3.11) with / = 1. Let Pg be the measure 
induced by the process {Aft,0 < t < T} when 0{.) is the true “drift” function. 
Following Theorem 3.3, we get that the Radon-Nikodym derivative of Pg with 
respect to Pg is given by 

= exp[ Q„As)dZ, - 2 (6-8) 
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Suppose the process X is observable on [0, T] and Xj, 1 < i < ri is a random 
sample of n independent observations of the process X on [0, T). Following the 
representation of the Radon-Nikodym derivative of P/ with respect to given 
above, it follows that the log-likelihood function corresponding to the observations 
{■X^i. 1 < i < n} is given by 

L„(Xi,...,X„;ff) = L„(0) (6.9) 

^ Jo ^ Jo 


where the process Q^'^g is as defined by the relation (6.4) for the process Xi. For 
convenience in notation, we write Qi^g{a) hereafter for Q^^g(s). 

Let {V„,n > 1} be an increasing sequence of subspaces of finite dimensions 
{d„} such that Un^iV), is dense in L^([0,r),dt). The method of sieves consists in 
maximizing L„{0) on the subspace V„. Let {e;} be a set of linearly indeijendent 
vectors in L^([0,T),dt) such that the set of vectors {ei,...,ed„} is a basis for the 
subspace V„ for every n > 1. For 0 € V„, fl(.) = have 


QiMO = j HH(t,s)0(s)Xi(s)ds 

d /** 

= f Kn(t,s)['^^0jej{a)]Xi(a)ds 

^ / >^H(t,a)ej{a)Xi{a)da 

“ awf Jo 


d„ 


Further more 


5^0,r,,,(f) (say). 
>=1 

rT Jn 


f Qi.»(t)dZi(t) = / ,j{t)]dz,{t] 

Jo Jo ^ 

= r ^i.At)dZi{t) 

i=i Jo 


dn 


^djR,.j (say) 
j=i 


and 


f Qle{t)dw» = [ lg;0,r..,(t)l^d«.." 

Jo Jo ^ 

Jo 


dn dn 

EE OjOk < R^.j,R, k > 

j=i *=i 


( 6 . 10 ) 


(6.11) 


( 6 . 12 ) 
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where < > denotes the quadratic covariation. Therefore the log-Iikelihood func- 

tion corresponding to the observations {Xj, 1 < i < n} is given by 


Q^e(t)dZ,{,t)- \ r Qle{.t)dw») (6.13) 

= EE - 5 E E >1 

j=l j=l lc=l 

>=1 i=I *=1 

where 

n 

®r’=""‘E^J- (614) 

i=l 

and 

n 

= '*■* E < ^ * >’ 1 ^ ^ (615) 

t=l 

Let /?*"* and/fl"* be the vectors and the matrix with elements 0j,j = 

Bj"\j = and = l,...,d„ as defined above. Then the log- 

likelihood function can be written in the form 


L„(9) = n[B<”>0<"> - 


(6.16) 


Here n' denotes the transpose of the vector o. The restricted maximum likeliho<Kl 
estimator 6*"'(.) of 0(.) is given by 


d, 


e(">(.) = 52 e‘">e>(.) 

J=l 

(6.17) 

where 

ff(n) ^ (p(n) ^(n)^ 

(6.18) 

is the solution of the equation 


^(n)^(n) _ ^(") 

(6.19) 

Assuming that A*"l is invertible, we get that 


ff(n) _ 

(6.20) 


Asymptotic properties of the estimator 0*"'(.) are studied in Prakasa Rao (2004b). 
We do not go into the details here. 

7. Remarks 

(l)We have considererl the stochastic differential equations of the type 

dY, = C(t)dt + B(t)dW,",t > 0 (7.1) 

driven by a fBm where B{.) is a nonrandom function. As was mentioned earlier, 
one can define a stochastic integral of a nonrandom function with respect to a 
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fractional Brownian motion via a suitable limit of Riemann-Steiltjes type approxi- 
mating sums as was described in Gripenberg and Norros (1996). However it is not 
possible to extend this approach to define stochastic integration of a large class 
of random functions with respect to a fractional Brownian motion in view of the 
fact that the fractional Brownian is not a .semimartingale. It is known that if a 
stochastic process {Zt,t >0} has the property that the stochastic integral f DtdZt 
is well-defined for a large class of integrands {BtJ. > 0 } and satisfies reasonable 
conditions such as linearity, dominated convergenege theorems as satisfied by in- 
tegrals with respect to <T-finite measures, then the process {Zt,l > 0 } has to be a 
semimartingale (cf. Metivier and Pellaumail (1980)). Hence the classical theory of 
stochastic integration with respect to a Brownian motion cannot be extended to 
define stochastic integration with resi)cct to a fBm for random integrands in the 
usual manner. Lin (1995) and Dai and Heyde (1996) defined stochastic integrals 
with respect to fBm and extended the Ito formula. Their definition of a stochastic 
integral leads to a stochastic integral of Stratonovich type and the corresponding 
Ito formula is the standard chain rule for differentiation. The stochastic integral 
f BtdZi defined by them however does not satisfy the property E{j BtdZi) = 0 
in general which is essential for modeling purposes. Duncan et al (2000) defined 
stochastic integration of a random function {Bi,t > 0} with respect to a fBm 
{IT" ,t > 0},// € ( 5 , 1 ) using the concept of Wick product and this integral sat- 
isfies the condition E{J BtdW/^) — 0 whenever it is well-definerl. They have also 
developed the correponding Ito type formula in their work. Using the notion of Sko- 
rokhod integral, Decreusefond and Ustunel (1999) developed a stochastic integral 
with respect to a fBm (cf. Decreu.sefond (2003)). 

( 2 ) We have assumed through out the Section 4 to Section 6 that a complete 
path of the process 1 X 4,0 < t <T} \s observable and that the j)rocess is driven by 
a fBm with knorun Hurst index //. The problem of estimation of the index // has 
been studiwl well and a di.scussion is given Section 2. The problem of estimation of 
the parameters in the absence of knowledge of the Hurst index H remains open. It 
would be interesting to find whether it is po.ssible to estimate the parameters and 
the index H simultaneously from a complete path of the process {Xt , 0 <f <T}. 
From a practical point of view, it is clear that the assumption, that a complete path 
of the process {X 4,0 < t < T} is oKservable, is not tenable. Suppose the process 
{Xt ,0 < ^ < T} is observed at some discrete set of times (tj, I < i < n) in the 
interval [0,T] where the time points (<i, 1 < ? < r?} could be nonrandom or random 
as well as equally spaced or irregularly spaced. If the process is observed at a set of 
discrete times, then the problems of estimation of the parameters involved as well 
as the estimation of Hurst index in ctise it is unknown remain open. It would be 
interesting to study these problems for the models discussed in this paper. A general 
di.scu.ssion on statistical inference from sampled data for stochastic proce.sses is given 
in Prakasa Rao (1988). Results for the special case of difftision type processes are 
studied in Prakasa Rtio (1999a). 
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Abstract: This paper shows how the invariance of the arc-sine distribution 
on (0, 1) under a family of rational maps is related on the one hand to various 
integral identities with probabilistic interpretations involving random variables 
derived from Brownian motion with arc-sine, Gaussian, Cauchy and other dis* 
tributions, and on the other hand to results in the analytic theory of iterated 
rational maps. 


1. Introduction 

Levy [20, 21] showed that a random variable A with the arc-sine law 

P{A e da) = (0 < n < 1) (1) 

irY^all — a) 

can be constructed in numerous ways as a function of the path of a one-dimensional 
Brownian motion, or more simply as 

/I = ^(1 — COS0) = j(l -cos20) = cos^0 (2) 

where 0 has uniform distribution on [0, 27r] and = denotes equality in distribu- 
tion, See [31, 7] and papers cited there for various extensions of Levy’s results. In 
connection with the distribution of local times of a Brownian bridge [29], an integral 
identity arises which can be expressed simply in terms of an arc-sine variable A. 
Section 5 of this note shows that this identity amounts to the following property 
of A, discovered in a very different context by Cambanis, Keener and Simons [6, 
Proposition 2,1]; for all real a and c 

a" , c* d (jai-Hcj)^ 

T + = — A — • 

As shown in [6], where (3) is applied to the study of an interesting class of multivari- 
ate distributions, the identity (3) can be checked by a computation with densities, 
using (2) and trigonometric identities. Here we offer some derivations of (3) related 
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Figure 1: Graphs of Qu{(i) for < 1 ^rnd v = k/lO with A' = 0, 1, ... , 10. 


to various other characterizations and properties of the arc-sine law. For u € [0, 1] 
define a rational function 


Q 




1 -a 


-1 


a(l — a) 


-f (1 — 2u)a 


( 4 ) 


So (3) amounts to Qu(^) = *4, as restated in the following theorem. It is easily 
checked that increfuic^ from 0 to 1 over (0, u) and decreases from 1 to 0 over 
(u, 1), as shown in the above graphs. 

Theorem 1. For each ii & (0, 1) the arc-sine law is the unique absolutely continuous 
probability measure, on the line that is invai'umt under the. rational map a — ► Q„(a). 

The conchusion of Theorem 1 for Qi/ 2 («) = 4«(1 — o) is a well known result 
in the theory of iterated maps, dating back to Ulam and von Neumann [38]. As 
indicated in [3] and [22, Example 1.3), this ca.se follows immediately from (2) and 
the ergodicity of the Bernoulli shift 0 >—>■ 26 (mod 2tt). This argument shows also, 
as conjectured in [15, p. 464 (A3)j and contrary to a footnote of [37, p. 233], that 

the arc-.sine law is not the only non-atomic law of A such that 4.4(1 — .4) = A. 
For the argument gives 4A(1 — .4) = A if A = (1 — cos27rt/)/2 for any distribution 

of U on [0,1) with {2U mod 1) =U, and it is well known that such U exist 
with singular continuous distributions, for instance U = A„,2~”‘ for A'„, 

independent Bernoulli(p) for any p € (0, 1) except p = 1/2. See also [15) and papers 
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cite<l t here for some related cliaracterizations of the arc-sine law, and [13] where this 
property of the arc-sine law is related to duplication formulae for various special 
functions defined by Euler integrals. 

Section 2 gives a proof of Theorem 1 based on a known characterization of 
the standard Cauchy distribution. In terms of a complex Brownian motion Z, the 
connection between the two results is that the Cauchy distribution is the hitting 
distribution on R for Zo = ±i, while the arc-sine law is the hitting distribution 
on [0, Ij for Zo = oo. The transfer between the two results may be regarded as a 
consequence of Levy’s theorem on the conformal invariance of the Brownian track. 
In Section 4 we use a closely related approach to generalize Theorem 1 to a large 
class of functions Q instead of Q^- The result of this section for rational Q can 
also be deduced from the general result of Lalley [18] regarding Q-invariance of the 
(Kiuilibrium distribution on the Julia set of Q, which Lalley obtained by a similar 
application of Levy’s theorem. 

2. Proof of Theorem 1 

Let A have the arc-sine law (1), and let C be a standard Cauchy variable, that is 

<»'»>■ 

We will exploit the following elementary fact [33. p. 13): 

(6) 

Using (6) and C = — C, the identity (3) is easily seen to be equivalent to 

uC-(l-u)/C = C. (7) 

This is an iiustance of the result of E. J. G. Pitman and E. J. Williams [28] that for 
a large class of meromorphic functions G mapping the half plane H"*" := {z 6 C : 
Im z > 0} to itself, with boundary values mapping R (except for some poles) to R, 
there is the identity in distribution 

G{C) = Re G(i) -t- (Im G{i))C (8) 

where i = v^— 1 and z = Rez -)- ilmz. Kemperman [14] attributes to Kesten the 
remark that (8) follows from Levy’s theorem on the conformal invariance of complex 
Brownian motion Z, and the well known fact that for t the hitting time of the 
real axis by Z, the distribution of Zt given Zo = z is that of Rez (Im z)C. 
As shown by Letac [19], this argument yields (8) for all inner functions on H'*', 
that is all holomorphic functions G from H'*' to H'*' such that the boundary limit 
G(x) := limyjo G{x + iy) exists and is real for Lebesgue almost every real x. In 
particular, (8) shows that 

if G is inner on H'*' with G(t) = t, then G(G) = C. (9) 

As iiulicated by E. J. Williams [39] and Kemperman [14], for some inner G on H'’’ 
with G(t) = i, the property G(G) = C characterizes the distribution of C among 
all absolutely continuous distributions on the line. These are the G whose action 
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on R is ergodic relative to Lebesgue measure. Neuwirth [26] showed that an inner 
function G with G(i) = i is ergodic if it is not one to one. In particular, 

Gu(z) :=uz- (I -u)/z (10) 

as in (7) is ergodic. The above transformation from (3) to (7) amounts to the 
.semi-conjugacy relation 

Qm 07 = 7 o where 7 (ic) := 1/(1 + w^). ( 11 ) 

So Qu acts ergodically as a measure preserving transformation of (0,1) equipped 
with the arc-sine law. It is easily seen that for m € (0, 1) a Q„-invariant probability 
measure must be concentrated on [0, 1], and Theorem 1 follows. 

See also [35, p. 58] for an elementary proof of (7), [ 1 , 23, 24, 2] for further study 
of the ergodic theory of inner functions, [16, 19] for related characterizations of the 
Cauchy law on R and [17, 9] for extensions to R”. 

3. Further interpretations 

Since w 1/(1 + w^) maps i to 00 , another application of Levy’s theorem shows 
that the arc-sine law of 1/(1 + C^) is the hitting distribution on [ 0 , 1 ] of a complex 
Brownian motion plane started at 00 (or uniformly on any circle surrounding [0, 1]). 
In terms of classical planar potential theory [32, Theorem 4.12], the arc-sine law is 
thus identified as the normalized equilibrium distribution on [0, 1]. The correspond- 
ing characterization of the distribution of 1 — 2^ on [-1,1] appears in Brolin [5], in 
connection with the invariance of this distribution under the action of Chebychev 
polynomials, as discussed further in the next section. Equivalently by inversion, the 
distribution of 1/(1 —2.4) is the hitting distribution on (— 00 , l]u[l,oo) for complex 
Brownian motion started at 0. Spitzer [36] found this hitting distribution, which 
he interpreted further as the hitting distribution of (— 00 , 1] U [l,oo) for a Cauchy 
process starting at 0. This Cauchy process is obtained from the planar Brownian 
motion watched only when it touches the real axis, via a time change by the inverse 
local time at 0 of the imaginary part of the Brownian motion. The arc-sine law can 
be interpreted .similarly as the limit in distribution i\s jxj — ♦ 00 of the hitting dis- 
tribution of [0, 1 ] for the Cauchy process started at x 6 R. See also [30] for further 
results in this vein. 

4. Some generalizations 

We start with .some elementary remarks from the perspective of ergodic theory. 
Let A(a) := 1 — 2a, which maps [0, 1] onto [— 1 , 1]. Obviously, a Borel measurable 
function has the property 

/HA) i A (12) 

for A with arc-sine law if and only if 

/(I — 2.4) = 1 —2A where / = A o o A~*. (13) 

Let p{z) := 5(2 4 - 2 “^), which projects the unit circle onto [—1, 1]. It is easily seen 
from (2) that (13) holds if and only if there is a measurable map / from the circle 
to itself .sjich that 


f{U) = U and / op(«) = po f{u) for [u] = 1 (14) 
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where U has uniform distribution on the unit circle. In the terminogy of ergodic 
theory [27], every transformation of [0, 1] which preserves the arc-sine law is thus 
a /actor of some non-unique transformation / of the circle which preserves Lebesgue 
measure. Moreover, this / can be taken to be symmetric, meaning 

m = W)- 

If / acts ergodically with respect to Lebesgue measure on the circle, then acts 
ergodically with respect to Lebesgue measure on [0, 1], hence the arc-sine law is the 
unique absolutely continuous /^invariant measure on [0, ij. This argument is well 
known in case /(^) = z'^ for d = 2,3,..., when it is obvious that (14) holds and 
well known that / is ergodic. Then f(x) = Td(x), the dth Chebychev polynomial 
[34] and we recover from (14) the well known result ([3], [34, Theorem 4.5]) that 

Ti(\-2A)i\-2A (d=l,2,...). (15) 

Let 0 := {z : jz] < 1} denote the imit disc in the complex plane. An inner 
Junction on D is a function defined and holomorphic on 0, with rarlial limits of 
moduliLs 1 at Lebesgue almost every point on the unit circle. Let <t>(z) := t(l -t- 
z)/(l — z) denote the Cayley bijection from 0 to the upper half-plane H'*'. It is 
well known that the inner functions G on H+, as considered in Section 2, are the 
conjugations G = 0 o / o </>'* of inner ftmctions / on O. So either by conjugation 
of (9), or by application of Levy’s theorem to Brownian motion in O started at 0, 

if / is inner on 0 with /(O) = 0, then f{U) = U (16) 


where U is uniform on the unit circle. If / is an inner function on O with a fixed 
point in 0, and / is not one-to-one, then / acts ergodically on the circle [26]. The 
only ont'-to-one inner functions with /(O) = 0 are /(z) — cz for some c with |c] = 1. 
By combining the above remarks, we obtain the following generalization of (15), 
which Ls the particular ca.se /(z) = z**: 

Theorem 2. Let f be a symmetric inner function on 0 urilh /(O) = 0. Define the 
transformation f on [—1,1] via the semi-conjugation 

f op(z) = pof{z) for ]z] = 1, where p{z) := i(z-t-z"‘). (17) 

If A has arc-sine taw then 

/(I -2A) = 1 -2.4. (18) 

Except if f(z) = z or f{z) = —z, the arc-sine law is the only absolutely continuous 
law of A on [0, 1] with this property. 

It is well known that every inner function / which is continuoiLS on the chxsed 
disc is a finite Blaschke product, that is a rational function of the form 


d 

/w=cn 

t=i 


z — a, 

1 — GiZ 


(19) 


for some complex c and with |c] = 1 and |a,[ < 1. Note that /(O) = 0 iff some 
Oj = 0, and that / is symmetric iff c = ±1 with some a, real and the rest of the Oi 
forming conjugate pairs. In particular, if we take c = l,oi = 0, 02 = a € (—1,1), 
we find that the degree two Bla.schke product 


/„(z) := z 


(1 — az) 


z — a 
2 “* — a 
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for a = 1 — 2u is the conjugate via the Cayley map <!>{z) := i(l + z)/{l — z) of 
the function Gu(w) = uw - {I — ii)/w on HI"'', which appeared in Section 2. For 
/ = /i_ 2 u the semi-conjugation (17) is the equivalent via conjugation by 0 of the 
semi-conjugation (11). So for instance 


07 o 0 = 7 o 0o /i_ 2 u where 700 ( 2 ) 


-( 1 -^)^ 

42 


( 20 ) 


so that 

700 ( 2 ) = i(l - Re 2 ) if 1^1 = 1, 

and Theorem 1 can be read from Theorem 2. 

Consider now a rational function /? as a mapping from C to C where C is the 
Riemann sphere. A subset A of C is completely R-invariant if A is both forward 
and backward invariant under R: for 2 € C, 2 € A R{z) € A. Beardon (4, 
Theorem 1.4.1] showed that for R a polynomial of degree d >2, the interval [—1,1] 
is completely /^-invariant iff R is T,i or — Trf. A similar argument yields 

Proposition 3. Let f be a symmetric finite Dlaschke product of degree d. Then 
there exists a unique rational function f which solves the functional equation 

f o p{z) = po f{z) for 2 € C, where p{z) := \ z~^). ( 21 ) 

This f has degree d, and [-1, 1] is completely f -invariant. Conversely, i/[— 1 , 1 ] is 
completely R-invariant for a rational function R, then R = f for some such f. 

Proof. Note that p maps the circle with ±1 removed in a two to one fashion to 
(— 1, 1), while p fixes ±1, and maps each of IP and D* := {2 : | 2 | > 1} bijectively 
onto (- 1 , 1 ]*^ := C\(— 1, Ij. Let us choose to regard 

p~ * (u)) = le + i \/l — 

as mapping [—1, 1]*^ to D. Then / := pofop~^ is a well defined mapping of [—1, 1]*^ 
to itself. Becau.se / is continuous and symmetric on the unit circle, this / has a 
corUinuous extension to C, which maps [— 1 , 1 ] to itself. So / is continuous from C 
to C, and holomorphic on [— 1 , 1 ]'^. It follows that / is holomorphic from C to C, 
hence / is rational. Clearly, / leaves [—1,1] completely invariant. 

Conversely, if [—1,1] is comjjletely R-invariant for a rational function R, then 
we can define / := p"* o /? o p as a holomorphic map D to D. Because R preserves 
[— 1, 1] we find that / is continuous and symmetric on the boundary of D. Hence / 
is a Blaschke product, which must be symmetric also on D by the Cauchy integral 
representation of /. □ 

As a check. Proposition 3 combines with Theorem 2 to yield the special case 
A' = [— 1, 1] of the following result: 

Theorem 4. Lalley [18] Let I\ be a compact non-polar stibset of C, and .suppose 
that K is completely R-invariant for a rational mapping R with R(oo) = 00 . Then 
the equilibrium distribution on K is R-invariant. 

Proof. Lalley gave this result for K = J{R), the Julia set of a rational mapping R, 
as defined in any of [5, 22, 4, 18], a.ssuming that R{oo) = 00 ^ J{R). Then K is 
neces.sarily compact, non-polar, and completely R-invariant. His argument, which 
we now recall briefly, shows that these properties of K are all that is reciuired 
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for tlie concliLsion. The argument Ls based on the fact [32, Theorem 4.12] that 
the normalized equilibrium distribution on A' is the hitting distribution of K for a 
Brownian motion Z onC started at oo. Stop Z at the first time t that it hits K. By 
Levy’s theorem, and the complete A-invariance of K, the path {R(Zt),0 < t < t) 
has (up to a time change) the same law as does (Z|,0 < < < r). So the distribution 
of the endpoint Zr is A-invariant. □ 

According to a well known result of Fatou [22, p. 57], the Julia set of a Blaschke 
product f is cither the unit circle or a Cantor subset of the circle. According to 
Hamilton [11, p. 281], the former case obtaiiLs iff the action of / on the circle is 
ergo<lic relative to Leliesgue measure. Hamilton [12, p. 88] states that a rational 
map A has [-1, 1] as its Julia set iff A is of the form describetl in Proposition 3 for 
.some symmetric and ergodic Blaschke product /. In particular, for the Chebychev 
polynomial Tt it is known [4] that J{Tj) = [—1, 1] for all d > 2, and [25, Theorem 
4.3 (ii)) that J{Qu) = [0, 1] for all 0 < u < 1. Typically of course, the Julia set of a 
rational function is very much more complicated than an interval or .smooth curve 
[22, 4, 8], 

Returning to consideration of the arc-sine law, it can be shown by elementary 
arguments that if Q preserves the arc-sine law on [0,1] and Q(a) = Pi{a)/Pi{a) 
with Pi a polynomial of degree », then Q = Qu or 1 — Qu for some u € [0, 1], This 
and all preceding results are consistent with the following: 

Conjecture 5. Every rational function A which preserves the arc-sine law on [0, 1] 
is of the form R(a) = 5(1— /(I — 2a)) where f is derived from a symmetric Blaschke 
product f with /(O) = 0, as in Theorem 2. 


5. Some integral identities 

Let (Bi,t > 0) denote a standard one-dimensional Brownian motion, l^et 

^3(2) := ^{x) :=j ip(z) dz = P(Bi > x). 

According to formula (13) of [29], the following identity gives two different expres- 
sions for the conditional probability density P{Bu € dx\Bi = b)/dx for U with 
uniform distribution on [0, 1], assumed independent of {Bt,i > 0): 


/‘ » J ] du= 

Jo - “) V \/“(l - «) / 


■i»(kl-Hfe-2:|) 

ip{b) 


(22) 


The first expression reflects the fact that B„ given B\ =b has normal distribution 
with mean bu aiul variance u(l — u), while the second was derived i n [29 ] by con- 
sideration of Brownian local times. Multiply both sides of (22) by \/2/jr to obtain 
the following identity for A with the arc-sine law (1): for all real x and b 


Now 


E 



l (x-b.4)^ \1 

2A(\-A))\ 


= 2e*’ 4>(|i| -I- [ft — a:[). 


(j - bAf ^ {x - by _ . 2 4 (|j| -H |fc- J|)^ _ .2 

A(1 - A) A 1 - A A 

where the equality in distribution is a restatement of (3). So (23) amounts 
identity 


r ( 1 

fx'^ , 

[exp(^-- 

Ia i-aJ]] 


= 2 4>(|x| + |s([) 


(23) 

(24) 
to the 

(25) 
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for arbitrary real x, y. Moreover, the identity in distribution (3) allows (25) to be 
deduc:ed from its special case y = 0, that is 


E 



24>(|o:|), 



which can be checked in many wa>'s. For instance, P{i/A € (it) = - 1) 

for t > 1 so (26) reduces to the known Laplace transform [10, 3.363] 


This is verified by observing that both sides vanish at A = oo and have the same 
derivative with respect to A at each A > 0. Alternatively, (26) can be checked as 
follows, using the Cauchy representation (6). Assuming that C is independent of 
Di , we can compute for x > 0 

= e~ 2 ® E |exp(z.TCBi)] = e~ 2 ^ E [exp(— xjZ?i |)| = 2$(x). 

(28) 

W’e note al.so that the above argument allows (24) and hence (3) to be deduced 
from (23) and (26), by uniqueness of Laplace transforms. 

By differentiation with respect to x, we see that (25) is equivalent to 


exp 




E 




(x > 0, y > 0). 


(29) 


That is to say, for each x > 0 and y >0 the following function of u € (0, 1) defines 
a probability density on (0, 1): 


/x,y(n) 


X 


■^27TU^{1 — u) 


exp 





(30) 


This was shown by Seshadri [35, §p. 123], who observed that fx^y is the density of 
Tx.y / ( 1 + Tx,y) for Tx,y with tlie inverse Gaussian density of the hitting time of x by 
a Brownian motion with drift y. In particular, fx,o is tin? density of x^/(x^ + Ej ). 
See al.so [29, (17)] regarding other appearances of the density fx,o. 


6. Complements 

The basic identity (3) can be transformed and checked in another way as follows. 
By uniqueness of Mellin transforms, (3) is equivalent to 


^ (1 ~ »)^ ± 1 
A£2 {1-A)£2 A£2 


(31) 


where £2 is an exponential variable with mean 2, assumed independent of A. But 
it is elementary and well known that A £2 and (1 — A )£2 are independent with the 
same distribution as Ej . So (31) amounts to 


(1— If)^ d 1 

jf2 + ya 


(32) 


where X and Y are independent standard Gaussian. But this is the well known 
result of Levy[20] that the distribution of 1/AT^ is stable with index The same 
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argument yields the following multivariate form of (3): if (W^i, . . . , Wn) is uniformly 
distributed on the surface of the unit sphere in R”, then for Oj > 0 


n 2 


(EL, “if 

tvp 


(33) 


This was established by induction in [6, Proposition 3.1]. The identity (32) can be 
recast as 

v2y2 v2 

a2X2 -f c 2 y 2 = (a + c)2 ^ ^ 

This is the identity of first components in the following bivariate identity in dis- 
tribution, which was derived by M. Mora using the property (7) of the Cauchy 
distribution: for p > 0 


{XY(1+pW a , 2 2 . 

X^ + p^Y^ ' X2 + p2y2 ) I 


(35) 


See Seshadri (35, §2.4, Theorem 2.3] regarding this identity and related properties 
of the inverse Gaussian distribution of the hitting time of a > 0 by a Brownian 
motion with positive drift. Given (X^,y^), the signs of X and Y are chosen as if 
by two independent fair coin tosses, so (34) is further equivalent to 


XY d X 
N/a2X2 + c2y2 ~ a + c 


(a,c > 0). 


(36) 


As a variation of (26), set x = y/^ and make the change of variable z = \/2Au 
in the integral to deduce the following curious identity: if X is a standard Gaussian 
then for all a: > 0 


( 


X 


Xv/X2-x2 


X > 


•) ■ 


(x > 0) 


(37) 


As a check, (37) for large x is consistent with the elementary fact that the distri- 
bution of {x{X — ar) I X > x) approaches that of a standard exponential variable 
as X — > 00 . The distribution of (x/(X\/X2 — x^) |X > x) therefore approaches 
that of l/v/2ei as x — ► oo, and E{l/\/2ei) = y/ Ttj2, 

By integration with respect to h{x)dx, formula (37) is equivalent to the following 
identity: for all non-negative measurable functions h 



xli{x) dx 
Xv^X2-x2 


1(X > 0) 


= E 



h{x)dx\{X 


> 0 ) . 


That is to say, for U with uniform (0, 1) distribution, assumed independent of X, 


\j^E[h (s/\-U^\X \ )] = £ I |A:| h( |X| t/)] . 

Equivalently, for arbitrary non-negative measurable g 

E [g ((1 - U^)X^)] = y/^E [ |X| hi.X^U'^)] . (38) 

Now X2 = Ae 2 where £2 ks exponential with mean 2, independent of A\ and when 
the density of X2 is changed by a factor of \/2^|X| we get back the density of £ 2 - 
So the identity (38) reduces to 

(1 -f/2)A£2 = U'^e2 
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aiid hence to 

(l-U^)A = U^. 

This is the particular case a = 6 = c = 1/2 of the well known identity 


t^a+b,C- — i^a,b+c 


for a,b.c> 0, where /?p,, denotes a random variable with the beta(p, q) dLstribution 
on (0, 1) with density at u pro|X)rtional to - u)’"', and it is assumed that 

lia+b,c and (3a.b are independent. 
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to Brownian motion 
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Abstract: A short variation of the original proof of Dubins and Schwarz of 
their result, that all continuous martingales can be time changed to Brcwnian 
motion, is given. 


It would bo liard to overstate niy debt to Herman Rubin, whose office Is jiust down 
the hall. My discussions with him range from twenty minutes on, say, infinitely 
divisible proces.ses or the Fisk-Stratonovich integral (FLsk was Herman’s student.) 
which completely changed the way I understood subjects I thought I knew, to 
countless small enlightenments, to providing an instant solution to a homework 
problem, due that day, that I realized, right before class, was a lot harder than 
I thought. 

This paper provides a short variant of the original proof of the result of Dubias 
and Schwarz [2] that continuoas martingales with unbounded paths can be time 
changed to standard Brownian motion. See [3] for a discassion of this theorem. We 
first consider the case that the paths of M are not constant on any open interval, 
and then discuss the general case. The embedding .scheme ase<l here was also used 
in [2], The novelty is the use of the lemma below. 

Theorem. Let Mi, t > 0, be a continuous martingale satisfying Mo = 0, 
sup, |Af,| = 00 , and P{M„ = A/„,a < s < 6) = 0 for all 0 < a < b. Then 
there are stopping times r;,, t > 0, which strictly and continuously increase from 0 
to infinity, such that Mj^, . < > 0, i.s Brownian motion. 

Proof. I^et Ufl’ = 0. and ujj!,., = inf{t > m- : |A/, — A/„»i| = 1}, k > 0, and let 
v^j = Uj if n.j > 0. We drop the superscript A/ for the rest of this ])aragraph. 
Then A/„j, j > 0, is a fair random walk, and A/„„ j > 0, has the distribution of 
a fair random walk dividetl by 2". Of course the distribution of the r„ j is different 
for different martingales, but the distribution of the ordering of these times is not. 
To be precise, the probability of any event in the algebra of events generated by the 
events {v,j < ct j} has the same probability for all martingales M. To see this, it 
helps to first check that P(ui ,3 < t^i.i) = 1/2, since the random walk him,, 3 — 
is embedded in the random walk A/„, ,, y > 0 by the discrete analog of the times 
I’o.t, and the probability of the analogous event for these walks is 1/2. Now since 
the walks Mm.jt j ^ 0, can for 0 < F < n all be embedded in the walk A/„„ 
j > 0, which is of course the .same walk for any M, the probability of an event in 
the algebra is the probability of an event for discrete random walk. □ 


^Statistics Department, Purdue University, West Lafayette, IN 17906, USA. e-mail: 
bdaviaCstat . purdue . edu 
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Lemma. For 0 < n < oo, let < j < oc, be a sequence, and suppose 

(i) 0 = «o,„, n > 0, 

(ii) j ^ 0, 

(iii) for all j and n, tn,j is one of the numbers tn+i,k, 

(iv) the set of all the t^j is dense in [O, oo). 

Then a sequence a„, n > 0, of nonnegative numbers converges if and only if given 
rn there is a j such that t,nj < < lm,j +2 for all large enough k. Furthermore, if 

K is a positive integer, an increasing nonnegative function f on [0, A'] is continuous 
if and only if given n > 0 there is m such that for each i, 0 < i < Km, there is 
j = j(i) such that tnj < f{i/^n) and f{{i + l)/m) < <n,j+ 2 - 

This lemma is obvious. Now let P^^.V the role of a„ and have the role 

of tn,j in this lemma. The conditions (i)-(iv) are easy to check, using the absence 
of flat spots for iv). The lemma implies that whether or not converges (a.s.) 

depends only on the distribution of the order of the . Since this latter distribution 
does not depend on M, we have either convergence for all M or no M. But if M is a 
Brownian motion B, we do have convergence, to 1. For, following Skorohod (see (!]), 
2 *" distribution of the average of 2^” iid random variables each having 

the distribution of uf := u. Since Eu = 1 and since the variance of u is finite, 
easily shown upon noting that P{u > A + 111/ > k) < P{\Z\ < 2), k > 0, where Z 
is standard normal, Chebyshev’s inequality gives this convergence to 1. Similarly 
limn_oo exists, where [] denotes the greatest integer function. Now 

the distribution of is the limit of the distributions of M,,m , since M has 

continuous paths, and thus is the same for all martingales M, and this limit can 
be identified, by taking M = D, as standard normal. All the joint distributions 
can be similarly treated, and so is Brownian motion. This implies that is 
strictly increasing. To see that it is continuous, iise the last sentence of the lemma. 
An argument like that just given shows that continuity on [0, A') for any A', and 
thus continuity on [0, oo), either holds for all or no M. And = t. Finally, since 
is continuous and strictly increasing, q^^ = siipjt>o limn— oc i) 22 nt]’ 

so is a stopping time. □ 

In case the paths of M have flat spots, remove them. Let A stand for the union 
of the open intervals on which M is constant. Let h{t) = inf{y : |(0,y)PlA‘^| = 
^ ^ ^ 0^5 where 1 1 is Lebesgue measure and the c denotes complement, so 
that if we define Nt = N is continuous with no flat spots. Whether or 

not N is & martingale, random walks can be embedded in it, .since they can be 
embedded M. Thus just as above, N,^n is Brownian motion. Put pi = h{i]^). Then 
fi is left continuous and strictly increasing, and is Brownian motion. And 

pt = supfc>oIimsup„_^ p, is a stopping time. 
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Abstract: A se<iuence of independent random variables { A'l , X 2 , • • •} is called 
a fl— harmonic Bernoulli .sequence if P(X< = 1) = 1 — P{Xi =0) = l/(t + 

B) i = 1,2,..., with B > 0. For ^' > 1, the count variable is the number 
of occurrences of the /r-string (1,0, ... ,0, 1) in the Bernoulli sequence. . . This 

fc-i 

paper gives the joint distribution of the count vector Z = (Zi,Z2 ,. . .) of 
strings of all lengths in a R— harmonic Bernoulli sequence. This distribution 
can be described as follows. There is random variable V with a Beta(B, 1) 
distribution, and given V = v, the conditional distribution of Z is that of 
independent Poissons with intensities (1 — u), (1 — v^)/2, (1 — v^)/3, . . .. 

Around 1996, Persi Diaconis stated and proved that when B = 0, the 
distribution of Zi is Poisson with intensity 1. Emery gave an alternative proof a 
few months later. For the case R = 0, it was also recognize<l that Zj , Z2 , . . . , Zn 
are independent Pois.sons with intensities 1, 5 , . . . , ^. Proofs up until this time 
made use of hard combinational techniques. A few years later, Joffe et al, 
obtained the marginal distribution of Zi as a Beta-Poisson mixture when B > 

0. Their proof recognizes an underlying inhomogeneous Markov chain and uses 
moment generating functions. 

In this note, we give a compact expression for the Joint factorial moment of 
(Zi, . . . , Zi\i) which leads to the joint distribution given above. One might 
feel that if Zi is large, it will exhaust the numl>cr of I’s in the Bernoulli 
stxjuence (X\, X 2 ,. ■ ■) anti this in turn would favor smaller values for Z2 and 
introduce some negative dependence. VVe show that, on the contrary, the joint 
distribution of Z is positively associated or possesses the FKG property. 


1. Introduction and summary 

Let {X, ; i > 1 } be a setiuence of independent Bernonlli random variables with 
success probabilities p, = P(X, = 1) = 1 — P{Xi = 0) for i > 1. For integers k > I, 
the sequence (1,0, ... ,0, 1) will be called a fc-string. Such a fc-string represents a 

' V ' 

it-1 

wait of length k for an “event” to happen since the last time it happened, or a run 
of length A: — 1 of “non-events.” Let Zk be the count (which may possibly be infinite) 
of such k strings in the Bernoulli sequence {Xi,X 2 ,...}. This paper is concerned 

def 

with the joint distribution of the count vector Z = {Z\, Z 2 , . . .) of all fc-strings. 
Such i)roblems appear in many areas such as random permutations, rank orders, 
genetics, abundance of species, etc. 
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Let Yi^k be the indicator variable that a fc-string has occurred at time i, 


i+A:— 1 


Vi,* = Xi n 

j=i 


= /( 




k-\ 


( 1 ) 

for t > l,fc > 1, where as usual, an empty product is defined to be equal to 1. A 
simple expression for Zt is then given by 


OO 

Z* = ^Vi,* for A- > 1. 
1=1 


(2) 


WTiile Z* is not a sum of independent summands, it can be easily expres.sed as the 
sum of k series, each of which has inde|)endent summands. From this ohserv-ation 
and Kolomogorov’s three-series theorem we can state the following remark which 
gives a necessary and sufficient condition that the random variable Z* be finite a.s. 

Remark 1. The count random variable Zi- of fc-strings is finite a.s. if and only if 
E\Zk] = Ei>,P. - Pi+j)Pi+k < 00. 

In this paper, we will concentrate exclusively on independent Bernoulli se- 
quences, with a particular type of ‘‘harmonic” sequence for {p,}, which allows for 
explicit computations and also, in some cases, connects the count vector (Zi, Z 2 , . . .) 
with the study of rank order statistics and random permutations. In fact, we will 
assume that {p; } satisfies 

Pi(l -Pi+i) =Pi+i or equivalently pi - p,+i = pip,+i forn>l. (3) 


We will avoid the case pi = 0, since then the only .solution to (3) is the the trivial 
solution p, = 0. We will therefore assume, for the rest of this paper, that pi = 
1/(1 -I- B), with jB > 0, so that from (3) 

P. = for « > 1- (4) 

We will refer to an independent Bernoulli sequence with {p, } given in (4) as a 
B— harmonic Bernoulli sequence. Occasionally, when we wish to emphasize the 
dependence on B, we will write Z*,b for the count variable Z^, and Zja for the 
count vector Z. From Remark 1, 


E[Zk.n] < ^PiP.+k 

*>1 

|^(t + B)(i + A- + fl) 

and thus Z*,b is finite, for all k > 1, a.s. 

When the counts (Zi,Z 2 ,...) are almost-surely finite, their joint distribution 
becomes an object of interest, especially its dependence on the sequence of proba- 
bilities {pi}- Around 1996, Persi Diaconis observed that, for 0— harmonic Bernoulli 
sequences, the distribution of the count variable Zi is Poisson with intensity 1. 
A few months later [Emery (1996)] gave another proof in an unpublished manu- 
script. It is known that the count vector (Zi,...,Z*,) of a 0— harmonic Bernoulli 
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sequence can be thought of the limit of the vector (Ci(n), . . . , Cfc(ri)) of num- 
bers of cycles of different orders among permutations of { 1, 2, . . . , n}. (More details 
are given in the next section.) This fact coupled with the classical results (see 
(Arratia et al. (2003)], [Arratia (1992)]) establish that the joint distribution of the 
count vector (Zi, Zj, . . . , Zjt), from a 0-harmonic Bernoulli se<iuencc, is that of 
indei>endcnt Poissons with inteasities (l,i,..., j), respectively. All these proofa 
mentionefl are based on combinatorial methods. 

[Joffe et al. (2002)] considered general B— hannonic Bernoulli .secpiences and ob- 
tained the moment generating function of of Zi.b by noticing that { (Sj, A',+i), i = 
1,2,...} forms an inhomogeneous Markov chain, where Si = . FVom 

this they identified the distribution of Zi as a Generalized Hypergeometric Factorial 
(GHF) law' which is more easily .stated as a Beta-mixture of Poisson distributioas. 

In this paper w'e consider general B— harmonic Bernoulli serpiences and obtain 
the joint distribution Pb of the count vector Z/j = (Zi.jj, Zj.b, . . .) . With the 
addition of another random variable V, the joint distribution Qb of (F, Zb) can be 
described as follows: the distribution of V is Beta w'ith parameters (B, 1) and the 
conditional distribution Pb.v of Z/j given F = u, is that of independent Poi.s.soas 

with intensities (1 — v), (1 — u*)/2, (1 — u®)/3 These results are contained in 

Theorem 2. 

We also compute the covariance of Z*,b and Z„^b for k < rn and note that it 
is positive for B > 0 in Corollary 2. We also show that Pb has the FKG or the 
positive association proj)erty in Theorem 3. There are intuitions for both positive 
and negative correlations between Zk,B and Zm.B and so this result is perhaps 
of interest. A plausible justification for positive correlations arises from the feeling 
that more completed fc-.strings allow one to “start over” more times in the Bernoulli 
sequence and so can lead to more strings of length m.. Although with the interpre- 
tation of Zk.B as the number of cycles of length k among reuidum permutations 

of E„,b - {1,2 ,n + B] when B > 0 is an integer (see the next section), the 

"age-dependent"-cycle count mapping gives perha[>s the opposite interpretation. 
Namely, w'ith more k-cycles formed, there should be less “room” for m-cycles to 
form in E„,b, leading to negative association between Zk.B and Z„,.d- One may 
think, however, for fixer! k < m much smaller than n j oo, that such "boundary” 
considerations are negligible and the first explanation is more reasonable given 
that the mixture distribution is of Beta type which has interpretations with re- 
spect to "reinforcement” dynamics (e.g. Polya urns). On the other hand, since the 
asymptotic joint distribution deitends on B, we know that the “boundary” is not 
completely ignored in the limit, thereby confusing the matter once more. It would 
be of interest to have a better understanding of these dependence issues. 

Our methods avoid the use of combinatorial techniques. We first show, in Lem- 
ma 2, that factorial powers of count variables Zk.B , which are sums of indicator 
variables Yi.k (see (2)) can be expressed as simple sums of products of the Yi.k's. For 
B-harmonic Bernoulli sequences, many products of the form Yi.kYj.k vanish and 
there are some independence properties among the Vi,*’®; (6), (7) and (8). These 

are exploited in Lemma 1, Lemma 2 and Lemma 3 to obtain the joint factorial 
moments of (Zi,/j, . . . , Z„,b) in the main theorem (Theorem 1) which is further 
simplified in Theorem 2 by recognizing it as the sum of probabilities of inequalities 
among indepeinient ex|xnient ial variables. The joint distribution of (Z|,b, . . . , Z„.b) 
can be deducetl from this simplified expression for the factorial moments. 

Even though the frequency of wait times between I’s of all orders are finite a.s., 
it is interesting to note that there are infinitely many I’s in the original Bernoulli 
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sequence (since 5Zt>iP» ~ Z]i>i V(* + ^) = However, the events (i,e. I’s) 
are so sparse that tTie wait to tiie first event has infinite mean when B > 0. Let 
N = inf {i > 1 : Xi = 1} be the wait to the first event. Then P{N = k) = 
B/[{k — l + B){k + B)] wlien B > 0, and thougli P{N < oo) = 1 we have E\N] = oo. 
In a similar fashion, wlien /? = 0, Xi = 1 a.s. and the wait for the second event 
iias infinite expectation. It is al.so not difficult to see that, no matter the value 
of B > 0, the number of I’s, N„ = •‘’^itisfies N„/logn — ► 1 a.s., and 

{Nn — logn)/v/logn iV((), 1) (cf. Example 4.6, Ch. 2 (Durrett (1995)]). 

Finally, a statistician may ask whether the value of B can be consistently esti- 
mated from the count vector Z of all fc-strings. To say that this can be done is the 
same as saying that Pb and Pb> are mutually singular for B ^ B'. Let Mb be the 
joint distribution of a B— harmonic Bernoulli sequence {Xj, f = 1,2,...}. We show 
in Theorem 4, by use of Kakutani’s criterion, that Mb and Mb' are absolutely 
continuous with respect to each other for B i=- B'. This implies the same for Pb 
and Pb'i and thus B cannot be consistently estimated from Z. 

2. Related areas 

Count vectors of A’-strings as described above, apart from being objects of in- 
trinsic research interest, have concrete inter j)retations with respect to combina- 
torics, genetics, ecology, statistics, and other areas (cf. (Arratia et al. (2003)], 
[.lohnson et al. (1992)], and [Antzoulakos and Chadjiconstantinidis (2001)] and ref- 
erences therein). We will describe some connections to rank orders, record values 
and permutations for the case when B > 0 is an integer. In both situations, there 
is an embedded sequence of independent Bernoulli r.v.’s with respect to which the 
counts of fc-strings have various interjiretations. 

Rank orders and record values. Let : n > 1} be a sequence of i.i.d. r.v.’s 
with common continuous distribution function F. One might think of as the 
amount of rainfall or the flood level in the nth year. Let < ^2,n < ■■■ < Un 
be the ordered values of : 1 < i < n} and define R„ = j if „. It is a 

well known therjrem of Renyi that { B„ : n > 1 } are independent and uniformly 
distributed on their respected ranges (cf. Example 6.2, Ch. 1 [Durrett (1995)]). Let 
{ai, 02 , . . .} be a sequence of integers such that 1 < o„ < n and define X„ = /(B„ = 
a„). The sequence {X„, n > 1 } is an example of a 0— harmonic Bernoulli sequence, 
for any choice of the sequence { 01 , 02 ,...}. The sequence {Xn.B = ^n+B,n > 
1}) n > 1} is an exarnjjle of a B— harmonic Bernoulli sequence when B > 0 is an 
integer. 

In the spt^cial case o„ = n for n > 1, the event Xn,B = 1 means that a record, 
with respect to the rainfall amounts in the first B years (which were lost or not 
properly ret-orded), w'as set during the j^ear n -I- B. In this cf\se, Zk,n is the number 
of times records were set after a wait of k — I years from a previous record. 

Of course, by choosing {o,,} differently, one can vary the interpretation of Zn,n- 

Random permutations. For B > 0 an integer, let En,B = {1, 2, . . . , n -f B}. W’e 
now describe the “Feller” algorithm which chooses a permutation tt : F„./j — + En,n 
uniformly from the (n-t-B)! possible permutations (cf. Section 4 [.Ioffe et al. (2002)], 
Chapter 1 of [Arratia et al. (2003)]). 

1. Draw the first element uniformly from En,B and call it 7 t( 1). If 7r(l) = 1, a 
cycle of length 1 has been completed. If 7 t(1) = j ^ 1, make a second draw uniformly 
from En,B \ {tr(l)} and call it 7r(7r(l)) = 7r(j). Continue drawing elements naming 


Copyrighted material 


144 


J. Scthuraman and S. Setkumman 


them Tr{7r(j))),ir({n{n(J)))), . . . from the remaining numliers until 1 is drawn, at 
which point a cycle (of .some length) is completed. 

2. FVom the elements left after the first cycle is completed. JEn.B \ {rr(l), . . . , 1 }, 
follow the process in step 1 with the smallest remaining number taking the role of 
“1." Repeat until all elements of E„,b are exhausted. 

When B = 0, n such Feller draws produces a random [jermutation, tt ; E„,o 
E„,o. However, when B > 0, in n such Feller draws, n : E„^b — * E„ b is only 
injective, and there may be the possibility that no cycle of any length is completed. 

Let now {/*"*: 1 < i < n} be the indicators of when a cycle is completed at 
the ith drawing in n Feller draws from E„,b- It is not difficult to see that {/,-"*} 
are independent Bernoulli random variables with = 1) = l/(n + B — i + 1), 

since at time j, independent of the past, there is exactly one choice among the 
remaining n + B — i + 1 members left in E„,b to complete the cycle (to paraphrase 
Example 5.4, Ch. 1 [Durrett (1995)]). 

For I < k <n, let be the number of cycles of length k in the first n Feller 
draws from E„,b- It is easy to see that 

Di^B ^ Zk.B for k > 1 
and we give a quick proof below. 

Indeed, since a cycle of length k is fini.shed on the mth draw, for m > k + 1, 
exactly when /^i),.(l - • • • (1 — = L and also since the first cycle 

is a k-cyclc exactly when (1 — /{"*)(! — /j"') • • • (1 — = L we have 


/i">)(i - 4"') • • ■ (1 + E • • • (1 


Let {Xi : I > 1| be independent Bernoulli random variables defined on a common 
space with P{Xi = 1) = l/(t + B), so that X, - in law for 1 < » < n. We 

can then write oquivalently in distribution as 

n—k n 

oL"^ = E^t(l-Xi+,)--(l-A-i+*-,)X,+* + X„_i.+, n (1-^r)- 

i— 1 j=n— it+2 

As lim„_oo X„_ii+|(l — X„_jt+ 2 ) • • • (1 — X„) = 0 in probability, we have 
■ >i 

We see from this construction, that Zk.a represents the asymptotic number of 
‘•young” or "age-depemlent” k-cycle numbers, that is, those formed in the first n 
Feller draws from sets of size n + B. 

3. Preliminary lemmas 

We will use the following standard definition of the factorial power of order r of an 
integer a: 

( a(a — l)---(a — r + 1) when a, r > 1 
nl''I = < 1 when r = 0 

I 0 when a = 0. 
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Equation (2) gives a representation for the count variable Zk of A'-strings as 
a series of dependent summands Yi^k, defined in (1) in terms of the B— harmonic 
Bernoulli sequence >1}. The summands {Vi.*,-,* >1} are indicator variables 

with the following useful properties 

Y,\. = y., Yi,kYi,k' = 0 if A ^ k\ Yi,kYi>,k' = 0 for i + 1 < < i + A, (6) 

Yi^k and Yi^k+j,m are independent for j > i, (7) 

E{Yi^k) = (^i^k-l + B)(i+k+B) ’ 

^(Yi,kYi+k,m) = (i+k-l+B)(i+k+rn-l+B){i+k+m+B) ' 

These properties allow us to give simplified expressions for products of factorial 
powers of the count vector (Zi, . . . , Z„) in terms of 

The following lemma gives a representation for the factorial power of a sum of 
arbitrary indicator variables. 

Lemma 1. Let (/i, / 2 , • • •) be indicator variables, and let Z = 5Zi>i their sum. 
Then for integers r > 1 , the factorial powers of Z have the following repiv.sentation: 

ZM= Y. Y - (») 

‘r 

dlBiinct 



Proof. The proof is by induction. For r = 1, the identity in (9) is obvious. Now 
assume that the same identity holds for r — 1, with r > 2. Write 

ZM = (Z-(r- 1))-Zl'-^1 

= (Z-{r-l))- Y A, 

*1 ‘r-1 

distinct 


Since Ij is 0 - 1 valued, Ij = Ij for all j, and we have 


Z Y 

•I «r-l 

distinct 



M *i — I 

distinct 


(r-l) Y Y - 

‘l <1 — 1 *1 

•llKtlni-t dUtinct 


Thus 



• l 

distinct 


This e.stablishes the identity in (1) for r and completes the proof of Lemma 1. □ 


Lemma 1 can be used to obtain expressions of products of factorial powers of 
count vectors in a routine way. Lemma 2 will improve on this and give an alternative 
expression for such a product, by exploiting property (6) of {Vi,*:}. To state this 
result we will need the following notation. 

Let Ai,Ao,...,A„ be distinct integers and let ri,r 2 ,...,r„ be (not necessarily 
distinct) integers all of which are greater than or equal to 1. Let Ro = 0,R„, = 
= I,---, 71 and let A„ = {A;}/!?! = {Aq, . . . , Aq, A 2 , . . . , A 2 , . . . , A„, . . . , A„}. 
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Let Sa„ Le the Rn^- permutations of A„^ thougli there are only ^ ) distinct 

permutations. Finally, for tr 6 let 

rn 

SmM = XI f°'' 1 $ 5 fin- (10) 

J=1 

Lemma 2. For n > I, let k\, . . . ,k„ > 1 be distinct integers and ri , . . . , r„ > 1 be 
[not necessarilg distinct) integers. Then, 

= E E (id 

>fGS/t„ 1<>1< ■■•<•«„ 


Proof. FVom Lemma 1 and (6), we get 




•4?’ 


n 


= n 

j=i 

E 

■ l + l* ••‘Hj 
dUtiact 




= E 


1*1 



dlstlDCt 

= E 

E 

Di.xiDj.nj ■ 






This completes the proof of Lemma 2. □ 

For a vector of integers k = (k\,k 2 ,...) with k„ > \ for all n, define Km = 
YljLi ll*® partial sums, k{r, s) = {kr, fcr+i. ■ ■■,k,) to be the segment from 

r to s. For 1 < m < n and r > 1, define 

C(ra-(m,«)) = E 


The following is a key lemma which gives two identities useful for the calculation 
of factorial moments of the count vector (Z|,b, . . . , Z^.b). 

Lemma 3. For integers r,n>l and vectors k the follomng two identities hold: 


£:(n,*.C(r+l;fc(2,n+Dl 


. n+1 


1 


r + Km + B ’ 


(12) 


and 

m= 1 

Proof. The proof is by simultancoits induction for both (12) and (13) on n, the 
nmnber of Y,,k factors in C(r : k{l, m)) where m — 1 -f 1 = n. Throughout, we will 
rely heavily on the properties (6), (7) and (8) of {Vj,*}. 

We will now establish (12) for n = 1. Notice that 

£[n.*.C(r-l-l;A:(2,2))] 

= Y. £[V;,*.D.*,1= E E[Yr,k,Y,k,\ 

i>r+l i>r+ki 
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i>r+fc|+l 

1 

“ (r + fci - 1 + B)(r + A'a - 1 + B){r + A'z + B) 

1 1 
(r + k,-l + B)(r + ki+B) (i + fcj - 1 + B){i + k 2 + B) 

1 

(r + fci - 1 + B)(r + A '2 - 1 + B)(r + K 2 + B) 

1 

(r + fc, - 1 + B)(r + fc, + B)(r + A '2 + B) 

1 1 

(r — 1 + fci + — 1 + A 2 + &) (r 4- fci + ~k A 2 + B) 

This establishes (12) for n = 1. 

Next, 

E[C(r;fc(l,l))] = ^ 

»i>r i>r 

= E 

i>r 


which establishes (13) for n = 1. 

For the induction step, let TV > 2 and assume that (12) and (13) hold for 
n = TV — 1. We first establish (13) for n = TV by using the validity of (12) for 
n = TV - 1 as follows: 

B[C(r;fc(l,TV))] 


= E 





= £ E^- *. 



^ r<i 



r ^ 

1 1 




1 i + A™-l + B Alj+v 

r<i '■m=l 

m=l 


iV 

1 


T7 


El ’• + 

m=5j 

- 1 + B' 



1 

(i + fci — 1 + B)(i + fci + B) 

1 1 

_(t + /ci — 1 + B) (1 + Ati + B) 
1 

1 + fci + B 


To finish the induction we now proceed to establish (12) for n = TV, assuming 
that (12) holds for n = TV — 1 and (13) holds for n = TV. Notice that 

B[V;.*,C(r+l;*:(2,TV + l))] = B[n,t,C(r + fc,; *(2, TV + 1))) 

= E[V;,t,yr+*.,*,C(r + A'2;A:(3,TV+ 1))] 

+ £[yr,*,]£[C(r + fe, + 1; A:(2, TV + 1))]. 


By conditioning on Xr+ki and noting that many terms vanish when Xr+ic, = 0, 
the first term above simplifies as follows: 
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£(V;,*. n-+*.,*,C(r + A 2 ; k(Z. N + l))] 

= E[Yr.k, £[VV+*,,*.,C(r + ATj; fe(3, + l))|Xr, . . . , X,+fc,]] 

= £[y;,*.£[V;+*..»,C(r + A-2;A:(3,A’+ l))|X.+t.ll 
= E[E[Yr,k,\Xr^k,]E[Yr+k,MC{r+ K2\k(3.N + l))|.V.+ib.]l 
= = 1] 

•£|n+i-..it.C(r+A-2;fc(3,JV + l))|Xr+fc. = 1]P(-V.+*. =1) 

= £[K-,t.|X.+*. = l]£'lV;+*.,*,C(r + A'2;*:(3,N+l))l. 


The assumption tliat (12) and (13) liold for n = A' — 1 yields 


£lVV,*.C(r+l;t(2.A-+l))l 
= E[Yr.k, |X.+t. = ll£'[y;+*.,t,C(r + £ 2 ; fr(3, iV + 1))] 
+ £[n,*.)£[C(r + t, + l;lt(2, AT + 1))] 

rN+l , JV+1 , 


r + Ati “ 1 + ^ 


r + Km + B 


(r + ll-i - 1 + B){r + ki + B) r + A'„ + B 


1 

rN+l . N + l . ^ 

n ^ TT ^ 

r k\ — \ B 

^V + A:,,. -H-B l\r+K,n + B 

‘■m=2 m = 2 

1 

N+l , N+l , 

TT * rr ^ 

' r A-, - 1 + B ii r - 1 - A„. B ^4 r -1- A„. -1- B 

rn=2 rn=l 

N+l j 

N+l j 

44, r + A'^-l-i-B 41 + + 

m=l m=l 


This establishes (12) for n = A’ and completes the proof of the lemma. □ 


4. Main results and corollaries 

Consider a B— harmonic Bernoulli sequence and the corresponding count vector Zn- 
For non-negative integers sj, S 2 . • • • 1 Sm define 

ab(-‘-. s„) = £(/l'i]4*j]---4’:J)- 

The following theorem gives an explicit form for the factorial moments of this count 
vector which will be used to identify its joint distribution. 

Theorem 1. Let Zg be the count vector arising from a B— harmonic Bernoulli 
sequence { Xj } • Let k\,..., k„ be distinct integers and letr\,..., r„ be not necessarily 
distinct integers, all greater than or equal to 1. Recall the notations R„„ 
and 5m(rr) from just before (10). Then 


£«‘Ul 


(-■d 
k'2,B 


. 1 


= E n 

1t€SAn 


Bm(tt) + B 


(14) 
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Proof. From Lemmas 2 and 3, using the notation in (10), 


This completes the proof of the theorem. □ 

The next theorem, which is the main result of this paper, gives the factorial 
moments of (Zi.n, • • • , Zn.b) for B— harmonic Bernoulli .sequences and deduces the 
structure of the joint distribution of Zg. 

Theorem 2. For non-negative integers si, . . . , s\. 

This implies that the joint distribution Pa of Zb has the following structure: there 
is random variable V and the joint distribution Qb of (T, Zb) can be described as 
follows: V has a Beta(D, 1) distribution (which is the point mass at 0 when D = 0) 
and given V = v, the conditional distribution Pb.v of (Z\,b. Z-^.b, ■ ■ ■) is that of 
independent Poissons with intensities 1 — v, 7" , . . . respectively. 

Proof. First, let B > 0 as the case B = 0 is analogous or can be obtained by 
taking the limit B J. 0. Second, to establish (15), we can assume that some s,„ > 0 
for some m. In fact, let (s^,, . . . ,s/t„) be the vector formed from the non-zeros in 
(s],.S 2 , . . . ,s/v), and let B„, A„, Sa„ and 5m(tr) for tr 6 Sa„ be as defined near (10). 
Let also Bq, IF| , W j, . . . , be independent exponential r.v.’s with failure rates 

B, Ai , . . . , A/f„ "=^ B,ki,. . . ,ki, . . . ,k„, . . . ,k„, respectively. Then, for any € Sa„ 

n r„ 


^h€Sa„ 1<U<-<IH„ 


Y, C(l:7r(l,B„)) 

I] 5,„(tr) + B- 

m=l 


n — — — 

ii, 5„.(tr) -1- B 


TT — — 

■Sm(tr) + B 




FVom Theorem 1 and (16), we conclude 


< Hio < Ho). 

(16) 



■ Pb(si,...,sn) 



■■z'f 


) 


«n 

E n 


m= 1 


5m (tr) -(- B 


Y ^(M>«„ <---<H'., <H'o) 

”€Sa„ 

B(max(H',,...,H'„„)<Ho) 
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•'« J=i 

N 

= / 77(1 

■'o >=i 


^j=i > •'» i=i 

where, for each w, Zi,„, Z 2 ,„, . . . are independent Poisson random variables with 
means (1 — w), (1 — n*)/2, . . respectively. This establishes the structure of Pb as 
desired. □ 


Remark 2. We now indicate an alternate argument to obtain Theorem 2. Consider 
the factorial moment generating function 

^n{h f„) "=' E 

r, r„>0 ri....r„. 

The denominator of the last factor in (14), 5R„(7r) + B, is the same for all values 
of 7T and erpiaLs Yl" + B. Hence, we have the recurrence relation 

n 

1 

which in turn leads to the partial differential equation 

= (j^h-D)<i,B + B. (17) 

j~i 1 

Also, the marginal factorial moment generating function <t>j,B(tj) of Zj,b satisfies 
jtjd<l)j,B{tj)/8tj = {tj — B)(f>j,B{tj) + B with the boundary condition 0j.fl(O) = 1. 
Its unique solution is <i>j,B(tj) = /□' Bu®~' exp{tj(l — tp)/j]Bv^~'dv. Then, the 
boundary conditions for the equation in (17) are </>b( 0, . . . ,0, fj, 0, . . . ,0) = d>j,B(tj) 
for 1 < j < n. It can l>e checked that equation (17) has a unique solution, namely 

<Pb(Ii t„) = J Bu®~' exp| ^ ^(1 - v')|du, 

which immediately gives the description of the joint distribution of Zb in Theo- 
rem 2. 


We now give .some corollaries of the main theorems. The first gives marginal 
factorial moments of the count Zt.B- 

Corollary 1. For a B— harmonic Bernoulli sejjuencc, 

£( 7 \^\ ) — !j 

^ * (It -h B)(2fc + B) - {rk + B) 

Proi>f. From Theorem 2, 




(k+B)(2k + B)--(rk + B)’ 
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The second corollary computes the covariance between Zkj,n and 

Corollary 2. 

Q 

cov(Zk,,B,Zk,.o) = + B)(k 2 + B)(ki +A -2 + B)' 

Proof. From (14) in Theorem 1, we have 

1 


E{Zk,.u)E{Zk3,B) = 
E{Zk,.BZk2.n) = 


1 


(fc, + B)(k2 + B) 

1 

(ki + B)(fc, + fcj + J3) ’’’ {k2 + B)(ki +k2 + B) 
1 B 


(A'l + B){k2 + B) {ki + B)(A'2 + B){k\ -t- k2 + B) 

This shows that Zk,,s and Zk^.B are positively correlaterl and 

B 


COv{Zk^,B, Zk2.B) = 


(ki + B)(k2 + B)[ki +k2 + B)' 


The FKG or positive association property of Pb is now established. 
Theorem 3. The joint distribution Pb of Z possesses the FKG property. 


Proof. Let /, g be a bounded functions on which are coordinate-wise increasing 
and are supported on a finite number of coordinates. We need to show that 


j f{Z)g{Z))dPB > j f{Z)dPB jg{Z)dPB. (18) 


It is well known that distributions on the real line and products of measures on the 
real line possess the FKG property [Liggett (1985)]. Since the Poisson distribution 
Is stochastically increasing in its intensity parameter, the product measure P„ ,n 
(cf. Theorem 2) is stochastically decreasing in v. This means that for any boundisl 
increasing function /, J f{z)dP„,B is decreasing in u. Thus 


I 


f(Z)g(Z)dPB = 


f{Z)g{Z)dP„,B dv 


J>‘-i 

> J f(Z)dP,..B j g{Z)dP„.B dv 

since P„,b >s a product measure 

> Bv'^-^ J f{Z)dP„,Bdv- B 

since J f{Z)dP„_B< J g(Z)dP„_By decreases in r 
= Efl(/(Z))£fl(s(Z)). 


This completes the proof of this theorem. □ 

Finally, in the introduction, we stated that the parameter B cannot be estimated 
from Z. This is a consequence of the fact below. 
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Theorem 4. Let Mb be the joint distribxition of the. B— harmonic Bernoulli se- 
quence {Xi}. Then for 0 < B < B', the. measures Mb and Mb' are absolutely 
continuous with respect to one another. 

Proof. Since Mb, Mb' are product measures, we compute the Kakutani dichotomy 
criterion 


Thus for B ^ B', Mb << Mb'- This also implies that Pb = Mb^ ' << Pb' = 
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Chebyshev polynomials and G-distributed 
functions of F-distributed variables 

Anirban DasGupta* and L. Shepp^ 

Purdue University and Rutgers University 

Abstract: We address a more general version of a classic question in prob- 
ability theory. Suppose X ~ Np(p, E). What functions of X also have the 
Np(/i, E) distribution? For p = 1, we give a general result on functions that 
cannot have this sj>ecial proj>erty. On the other hand, for the p = 2, 3 cases, 
we give a family of new nonlinear and non-analytic functions with this prop- 
erty by using the Chebyshev polynomials of the first, second and the third 
kind. As a consequence, a family of rational functions of a Cauchy-distributed 
variable are seen to be also Cauchy distributed. Also, with three i.i.d. N{0, 1) 
variables, we provide a family of functions of them each of which is distributed 
as the symmetric stable law with exponent The article starts w’ith a re- 
sult with astronomical origin on the reciprocal of the square root of an infinite 
sum of nonlinear functions of normal variables being also normally distributed; 
this result, aside from its astronomical interest, illustrates the complexity of 
functions of normal variables that can also l>e normally distributed. 

1. Introduction 

It is a pleasure for both of us to be writing to honor Herman. We have known and 
admired Herman for as long as we can remember. This particular topic is close to 
Herman’s heart; he has given us many cute facts over the years. Here are .some to 
him in reciprocation. 

Suppose a real random variable X ~ N{ii,o^). Wdiat functions of X are also 
normally distributed? In the one dimensional case, an analytic map other than the 
linear ones cannot also be normally distributed; in higher dimensions, this is not 
true. Also, it is not possible for any one-to-one map other than the linear ones to 
be normally distributed. Textbook examples show that in the one dimensional case 
nonlinear functions U{X), not analytic or one-to-one, can be normally distributed 
if X is normally distributed; for example, if Z ~ A^(0, 1) and <I> denotes the N{0, 1) 
CDF, then, trivially, U{Z) = 4>”^(2^(|Z|) — 1) is also distributed as N{(), 1). Note 
that this function U{.) is not one-to-one; neither is it analytic. 

One of the present authors pointed out the interesting fact that if X, Y are i.i.d. 
N{{), 1), then the nonlinear functions U{X,Y) = 

are also i.i.d. N(0, l)-distributed (see Shepp (1962), Feller (1966)). These are ob- 
viously nonlinear and not one-to-one functions of X,Y. We present a collection 
of new pairs of functions U{X^Y),V{X,Y) that are i.i.d. N{0, l)-distributed. The 
functions U{X,Y),V{X,Y) are constructed by using the sequence of Chebyshev 
polynomials of the first, .second and the third kind and the terrain corresponding 
to the plots of U{X,Y),V{X,Y) gets increasingly more rugged, and yet with a 
visual regularity, as one progresses up the hierarchy. Certain other results about 

'Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 
47907-2068, USA. e-mail: dasguptaCstat.purdue.edu 

^Department of Statistics, Rutgers University, Piscataway, NJ 08854-8019, USA. e-mail: 
sheppCstat . rutgers . edu 

Keywords and phrases: analytic, Cauchy, Chebyshev polynomials, normal, one-to-one, three 
term recursion, stable law. 
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Cauchy-distributed functions of a Cauchy-distributed variable and solutions of cer- 
tain Fredholm integral equations follow as corollaries to these functions U, V being 
i.i.d. 7V(0, 1) distributed, which we point out briefly as a matter of fact of some a<l- 
ditional potential interest. Using the family of functions U{X,Y),V{X,Y), we also 
provide a family of functions f{X, F, Z), g{X, Y, Z), h{X, Y, Z) such that /, h are 
i.i.d. N(0, 1) if X, F, Z are i.i.d. N(0, 1). The article ends with a family of functions 
of three i.i.d. AT(0, 1) variables, each distributed as a symmetric stable law with 
exponent the construction uses the Chebyshev polynomials once again. 

VVe start with an interesting example with astronomical origin of the reciprocal 
of the square root of an infinite sum of dependent nonlinear functions of normally 
distributed variables being distributed as a normal again. The result al.so is relevant 
in the study of total signal received at a telephone base station when a fraction of the 
signal emitterl by each wireless telephone gets lost due to various interferences. See 
Heath and Shepp (2003) for description of both the astronomical and the telephone 
signal problem. Besides the quite curious fact that it should be normally distributed 
at all, this result illustrates the complexity of functions of normal variables that 
can al.so be normally distributed. 


2. Normal function of an infinite i.i.d. A^(0, 1) sequence: An astronomy 
example 

Proposition 1. Suppose //o, >7i,r/2, . . . is a sequence of i.i.d. N{0, 1) variables. We 
show the following remarkable fact: let S„ = rfl.Then 


N = 


sgn(r^)) 



The problem has an astronomical origin. Consider a fixed plane and suppose 
stars are distributed in the plane according to a homogeneous Pois.son process with 
intensity A; assume A to be 1 for convenience. Suppose now that each star emits 
a constant amount of radiation, say a unit amount, and that an amount inversely 
proportional to some power k of the star’s distance from a fixed point (say the 
origin) reaches the point.If k = 4, then the total amount of light reaching the origin 
would equal L = » where the 7< are i.i.d. standard expo- 

nentials, because if the ordered distances of the stars from the origin are denoted 
by R\ < R 2 < R:\ < .. ., then Rf^ ~ ^('y^ ^2 + • • • -f 7„), where the 7* are i.i.d. 

standard exponentials. Since the sum of squares of two i.i.d. standard normals is 
an exponential with mean 2, it follows that L has the same distribution as ‘^ 7 , 
where N is as in Proposition 1 above. In particular, L does not have a finite mean. 
Earlier contributions to this problem are due to Chandrasekhar, Cox, and others; 
for detailed references, see Heath and Shepp (2003). 

To prove the Proposition, we will show the following two facts: 

(a) The Laplace transform of equals Ee ^ 

(b) If 7 ] ~ A’(0, 1), then the Laplace transform of ^ etjuals e"'^. 

To prove (a), consider the more general Laplace transform of the sum of the 
fourth powers of the reciprocals oi R E S only for 0 < a < /? < 6, where a, b are 

fixed, 0(A,a,6) = Ee We want 0(A,O, cx)), but we can write the 
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‘•recurrence’’ relation: 


a6 

«( A, a, 6) = / e-’'<'-’—’)«!.(A.r,6)e-'’'"2;rrdr 

Ja 


where the first term considers the possibility that there are no points of S in the 
annulus a < r < b and the integral is written by summing over the location of 
the point in the annulus with the smallest value of /? = r and then using the 
independence properties of the Poisson random set. 

Now multiply both sides by e~’"‘ and differentiate on o, regarding both b and A 
as fixed constants, to get 

(-2tra<^(A. 0,6) + 0'(A,a,6))e“’"‘^ = -2trae'’"‘”0(A,o,6)e"''“ *. 


Dividing by e and solving the simple differential equation for <^(A, o, 6), we get, 

(6(A, 0, 6) = 0( A, 0, 6)e^* 

Since ^(A,6, 6) = 1, we find that 


Finally let 6 — • oo to obtain 0, oo) as was desired. Evaluating the integral by 
changing u = and integration by parts, gives the answer stated in (a). 

(b) can be prov«i by direct calculation, but a better way to .see this is to u.se 
the fact that the liitting time, t\, of level one by a standard Brownian motion, 
VP(t),t > 0, has the same distribution as t;”* using the reflection principle, 

P(n<0 = F(max»’(u),ue [0,«] > 1) =2P(H'(t) > 1) = P(v7|t;| > 1) 

= P{n-^<t). 

Finally, Wald’s identity ^ 

£gAW (r,)-Vn ^ 

and the fact that lV’(ri ) = 1 gives the Laplace transform of tj and hence also of rj”*, 
as .. , — 

Ee~^"' = Ee~^^' = e~'^. 

This completes the proof of Proposition 1 and illustrates the complexity of functions 
of normal variables that can also be normally distributed. 

3. Chebyshev polynomials and normal functions 
3.1. A general result 

First we give a general result on large clas.scs of functions of a random variable 
Z that cannot have the same distribution as that of Z. The result is much more 
general than the special case of Z being normal. 

Proposition 2. Let Z have a density that is symmetric, bounded, continuous, 
and everywhere strictly positive. If f(Z) ^ -kZ u either one-to-one. or has a zero 
derivative at some point and has a uniformly bounded derivative of some order 
r > 2, then f(Z) cannot have the same distribution as Z. 


Proof. It is obvious that if f(z) is one-to-one then Z and f(Z) cannot have the same 
distribution under the stated conditions on the density of Z, unless f{z) = ± 2 . 
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Consider now tlie case timt f{z) has a zero derivative at some point; let us take 
thLs iKjint to be 0 for notational convenience. Let us also suppose that ^ ^ 

for all 2, for .some K < oc. Suppose such a function f{Z) has the same distribution 
as Z. 

Denote /(O) = o; then P(\f(Z) — a| < e) = P{\Z — f»| < f) < C\C for some 
Cl < 00, because of the boundedness assumption on the density of Z. 

On the other hand, by a Taylor expansion around 0, f{z) = a + y/”(0) H 1 

^/*'‘'(2*), at some point between 0 and z. By the uniform boundedness condition 
on P’’\z), from here, one has P{\f{Z) - o| < e) > P{a2\Z\^ + a3\Z\^ + ■ ■■ Or\Z\^ < 
e), for some fixed positive constants 02, 03, . . . ,Or. For sufficiently small c > 0, this 
implies that P{\f{Z) — o| < c) > P(M\Z\^ < e), for a suitable positive constant 
M. 

However, P{M\Z\^ < f) > C2\/f for some 0 < C2 < 00, due to the strict 
positivity and continuity of the density of Z. This will contradict the first bound 
P{\f(Z) — o| < f) < cif for small e, hence completing the proof. □ 

3.2. Normal functions of two i.i.d. N{0, 1 ) variables 

Following standard notation, let T„(x),Un(x) and V„(x) denote the nth Chebyshev 
polynomial of the first, second and third kind. Then for all n > 1, the pairs of 
functions {Z„, W„) in the following result are i.i.d. A’(0, 1) distributed. 

Proposition 3. Let X, Y ' iV(0, 1). For n> I, let 


Then. Z„.\V„ N(0, 1). 

There Is nothing .special about X, Y being i.i.d. By taking a bivariate normal 
vector, orthogonalizing it to a |)air of i.i.d. normals, applying Proposition 3 to 
the i.i.d. pair, and then finally retransforming to the bivariate normal again, one 
similarly finds nonlinear functions of a bivariate normal that have exactly the same 
bivariate normal distribution as well. Here is a formal statement. 

Corollary 1. Suppose (Xi.X^) ~ N(0,0, 1, l,p). Then, for all n> I, the pairs of 
functions (yin,V2n) defined as 


an‘ also distributed as Af(0, 0, 1, l,p). 

The first few members of the polynomials T„{x),U„{x) are Ti(r) = x,T 2 (x) = 
- 1, T3(i) = 4x-* - 3x, T^ix) = 81^ - 8x* + 1, rs(x) = 16i^ - 20x» + 5x, TrIx) = 
32x'‘ — 48x‘‘ + 18i^ - 1, and Uo(x) = l,f/i(i) = 2x, 1/2(1) = 4i^ - X.Usix) = 
81* — 4x, 1/4(1) = IGi'' - 12x^ + l,f/s(i) = 32i'’ - 32x''* + 6x; see, e.g. Mason 
and Handscomb (2003). Plugging these into the formidae for Z„ and H’„ in Propo- 
sition 3, the following illustrative pairs of i.i.d. A^(0. 1) functions of i.i.d. Af(0. 1) 
variables X, Y are obtained. 



and 



lin = pY,„ + s/T^sJxl -I- (1 + fF)Xl - 2pXiX2T„ 
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Example 1. Pairs of i.i.d. A'((), 1) Distributor! Ftinrtiotis when X,V' ‘ ~ A’((), 1) 


2XY 


VX^ + 


and 

Y 


A-2 - >-2 

v/x^Tv^ 

and (X'* 


(Shopp's example) 
X 


and 


' A'2 + V'2 

X4 _ 6X^K^ + Y* 

(X2 + y2)§ 

6X®V-20X3V'* + 6XV'5 

5 u»d 

(X2 + E2)§ 


•’^"),Y2 + y2 
4xr(x2 - v'2) 

(X2 + y2)j 

and (sy^ - 10X'2y2 + X'*) 


X 


X'* - i5X'‘y2 + i5X2y« - y« 
(X2 + y2)5 


(A'2 + V-'2)2 


Remark 1. Since X„(X, 1') and n'„(X,y) are i.i.d. A(0. 1) whenever X', V' are 
i.i.d. A(0. 1), one would get an i.i.d. pair of standard normals by considering the 
fimctions Z,„(Z„(X,y),»y„(X,y)) and ir,„(Z„(X,y),U’„(X,y)). It is interest- 
ing that z„(z„(x,y),iy„(x,y)) = z„„{x,y) and n-„(z„(x,y),n„(x,y)) = 

Il mnCX, y). Thus, iterations of the functions in Propasition 3 pnxiuce members of 
the same sequence. 


Remark 2. Consider the second pair of functions in Example 1. One notices that 
but for a sign, the second function is obtaimxl by plugging Y for A' and X for Y 
in the first ftinction. It is of course obvious that Itecaiuse X, >' arc i.i.d., by writing 
y for X and X for V', we cannot change the distribution of the function. What 
is interesting is that this operation proiluces a function indeitendent of the first 
function. This in fact occurs for all the even numbered pairs, as is formally stated 
in the following proixtsition. 

Proposition 4. For every n > 0, ll2„+i(.V, )') = (— l)"Z2„+i(y, X), and hence, 
for every n > 0, Z2„+i(X, T) and Z2n+i(y, A') are independently distributed. 

Progressively more rugged plots are obtained by plotting the functions Z„(x,y) 
and W„(x,y) as n increases: des|)ite the greater ruggedness, the plots also get 
visually more ap|x:aling. A few of the plots are presented next. The plots lalteled 
as V’ corrcspoixi to the functions 11’ of Proposition 3. 

Analogous to the Chebj’shev |H>lynomials of the first and setxmd kind, those of 
the third kind also produce standard normal v-ariables. However, this time there is 
no independent mate. 

Proposition 5. Let X,Y ' ~ A(0, 1). For n > 1, let 
Then Q„~N (0,1). 

The first few polynomials Vn(i) are V'i(i) = 2 t- 1, V 2(-i^) = 4i2-2x - 1, Vsix) = 
8x^ — 4x2 _ 4 j. Ki(i) = Ifix'' — 8x® — 12x2 .j. 42- + j. Plugging these into the 
formula for Q„, a sequence of increasingly complex standard normal functions of 
X, y are obtained. 

For example, iLsing n = 1, if X, 1' are i.i.d. A'(0, 1), then ^ (2A' — 
v/X2 -I- y2)^l + is distributed as A’(0, 1). In comparison to the A’(0, 1) 

functions Z2, H’2 in Si-ction 3.2, this is a more complex function with a A’(0, 1) 
distribution. 
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Plot of Z4(x,y) 



Plot of ZS(x,y) 
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Plot of viO(x.y) 
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3.3. The case of three 

It is interesting to oonstrnct explicitly three i.i.cl. N{Q,1) functions f{X,Y,Z), 
g(X,Y, Z),h{X^Y, Z) of three i.i.cl. iV(0, 1) variables X,Y,Z. In this section, we 
present a method to explicitly construct such triplets of functions f{X,Y,Z), 
g{X,Y, Z),h{X,Y, Z) by using Chebyshev polynomials, as in the case with two 
of them. The functions /, g, h we construct are described below. 

Proposition 6. Let X,Y,Z JV(0,I). If U(X,Y),V(X,Y) are i.i.d. Af(0,l), 
then f(X, Y, Z), g{X, Y, Z),h(X, Y, Z) defined as 

f(X,Y,Z) = U(V{X,Y),V{U(X,Y),Z)), 
g(X,Y,Z) = V(V(X,Y),V(U(X,Y),Z)), 
h(X,Y,Z) = U{U{X,Y),Z) 

are also distribxited as i.i.d. A^(0, 1). 

Example 2. For U{X, T), V(X, T), we can u.se the pair of i.i.d. A^(0, 1) functions 
of Proposition 3. This will give a family of i.i.d. N{0, 1) functions /, g, h of X, T, Z. 
The first two functions f,g of Proposition 6 are too complicated even when we 
use U = Z 2 and V = W 2 of Proposition 3. But the third function h is reasonably 
ticly. For example, using U = Zn, and V = VF„ with n = 2, one gets the following 
di.stributed as X(0, 1): 

h( Y V 7\- AXYZ 

’ ’ v/4X2y2 + ^2(^2 + y2) • 


4. Cauchy distributed functions, Fredholm integral equations and the 
stable law of exponent ^ 

4.1. Cauchy distributed functions of Cauchy distrHbuted valuables 

It follows from the result in Proposition 3 that if C has a Cauchy{Q, 1) distribution, 
then appropriate sequences of rational functions CA„(C) also have a Cauchy{0, 1) 
distribution. These results generalize the observations in Pitman and Williams 
(1967). This results, by consideration of characteristic functions, in the Caxichy{0, 1) 
density being .solutions to a certain Fredholm integral equation of the first kind. This 
connection .seems to be worth pointing out. First the functions fn{C) attributed to 
above are explicitly identified in the next result. 

Proposition 7. Let C ~ Cauchy{0, 1). Let R = (^nd for k > I, 


fk{C) 

9k{C) 


1 + 2T2{R) + 2T4{R) + • • • + T2k{R) 

T2k{R) 

2T,{R) + 2T:,{R) + ---A-T2m{R) 
T2k+l{R) 


and 


Then Cfk{C) and Cgk{C) are also ~ Cauchy{0, 1). 

Example 3. The functions fk,gk for small values of k are as follows; 


MC) = 

/2(C) = 

h{C) = 


9i{C) = 


C^-3 


1 -C72’ 

4 - 4C2 
C > - 6C2 + 1 ’ 

GC** - 20C2 + 6 


C‘‘ - 15C‘> + 15C2 - 1 


3C2 - 1 ’ 

_ C-»-10C2 + 5 . 

^2(^J 5C'«-10C2 + 1’ 

C‘* - 21C‘‘ + 35C2 - 7 
9A ) - _ 35C-4 ^ 21C2 - 1 
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Note that fk,9k are rational functions of C. Proposition 7 thus gives an infinite 
collection of rational functions, say A„(C), such that C\n{C) ~ CVn. This implies 
the following result on Fredholm integral equations. 

Proposition 8. Consider the Fredholm integml equation K{t, y)p{y)dy = g{t), 
where K{t, y) = cos{tyX{y)) and g{t) = Then for any of the rational functions 

Hy) = fk{y),gk{y) in Proposition 7, the Cauchy{0, 1) density p(y) = is a 

solution of the above Fredholm equation. 


4-2. The stable law with exponent ^ 

Starting with three i.i.d. standard normal variables, one can construct an infinite 
collection of functions of them, each having a symmetric stable distribution with 
exponent The construction uses, as in the previous sections, the Chebyshev 
polynomials. It is described in the final result. 


Proposition 9. Let X,Y,N be i.i.d. N{0,1). Then, for each n > I, = 
z„(k Y^^(x Y) ’ ^ “^2,71 = w„(x,Y)z^{X Y) i^^ve a symmetric stable dis- 


tribution with exponent 

Example 4. Using n = 2, 3, the following are distributed as a symmetric stable 
law of exponent 


N 


N 


(x2 + r2)^ 

4X2y2(X2 _ y-2) 

(X2 + y2)3 

Xy2(X2 - 3T2)(3X2 _ y2)2 


2XK(X2-y2)2’ 

and 

X2y(3X2-y2)(X2_3y2)2- 


5. Appendix 

Proof of Proposition 3. Proposition 3 is a restatement of the well known fact that 
if X,Y are i.i.d. A^(0, 1), and if r,6 denote their polar coordinates, then for all 
n > l,rcosn0 and rsinn0 are i.i.d. N{0, 1), and that the Chebyshev polynomials 
Tn{x),Un{x) are defined by T„(ar) = cos ti6,Un{x) = ^ — cos0. 

Proof of Pivposition 4- We netxl to prove that for all x,y, 

<=► Vw, \/l — W^U2n(w) 

= (-1)"72„+,(\/1-u,2). 

Note now that 

£(-i)"r2„+,(N/i^) = + i)y2„(yrr;;3), 

by using the identity 
On the other hand. 

Vl - w'^ 1 - 
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by using the identity 


d,,,, (k + 2)Ut-,(w) - kUt+,{w) 

= 2(r^:i^5) ’ 

see Mason and Handscomb (2003) for these derivative identities. 

It is enougli to show that the derivatives coincide. On some algebra, it is seen 
that the derivatives coincide iff f/ 2 n-i(te) — wU2n{w) = (— 
which follows by induction and the three term recursion for the sequence f/„. 

Proof of Proposition 5. Proposition 5, on some algebra, is a restatement of the 
definition of the Chebyshev polynomials of the third kind as Vn{x) = We 

omit the algebra. 

Proof of Proposition 6. If X,Y,Z are i.i.d. 7V(0, 1), and U{X,Y),V{XyY) are 
also i.i.d. A^^(0, 1), then, obviously, U{X,Y),V{X,Y),Z are i.i.d. A^(0, 1). At the 
next step, use this fact with X, Y, Z replaced respectively by U{X, Y), Z,V{X, Y). 
This results in U{U{X,Y), Z),V{U{X,Y), Z),V{X,Y) being i.i.d. A^(0,1). Then 
as a final step, use this fact one more time with X, Y, Z replaced respectively by 
P(X, K), V{U(X, Y), Z), U{U(X, y ), Z). This completes the proof. 

Pmof of Proposition 7. From Proposition 3, Cauchy {Q, 1) for all n > 1. 

Thus, we need to reduce the ratio to Cfk{C) when n = 2k and to Cgk{C) 

when n = 2A' + 1, with C standing for the Cauchy-distributed variable 

The reduction for the two cases n = 2k and = 2k + I follow, again on some 
algebra, on using the following three identities: 

(i) wUn-l{w) = Un{w) - Tn{w); 

(ii) Ihkiw) = To{w) + 2T2 {w) -h 1- 2T2k{w)] 

(iii) U2k+i{u’) = 2Ti(w) + 2X3(10) + •■■ + 2T2k+i{n)); 

see Mason and Handscomb (2003) for the identities (i)-(iii). Again, we omit the 
algebra. 

Proof of Proposition 8. Proposition 8 follows from Proposition 7 on using the facts 
that each fk,9k are even functions of C, and hence the characteristic function of 
Cfk(C) and Cgk(C) is the same as its Fourier cosine transform, and on using also 
the fact that the characteristic function of a Cauchy(^,\) distributed variable is 

e-l'l. 

Pwof of Proposition 9. Proposition 9 follows from Proposition 3 and the well known 
fact that for three i.i.d. standard normal variables X, Y, N, is symmetric stable 
with exponent .see, e.g., Kendall, Stuart and Ord (1987). 
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Zeroes of infinitely differentiable 
characteristic functions 

Herman Rubin' and Thomas M. Sellke' 

Purdue University 

Abstract: We characterize the sets where an n-dimensiunal, inhnitely differen- 
tiable characteristic function can have its real part zero, positive, and negative, 
and where it can have its imaginary part zero, positive, and negative. 


1. Introduction and summary 

Let / : R" — ♦ C be the characteristic function of a probability distribution on R”. 
Let A'*' C R” be the set on which Re{F(-)} is strictly positive, and let A~ be the 
set on which Re{F(-)} is strictly negative. Let be the set on which Im {/(•)} 
is strictly positive. What can we say about the sets A~^, A~ , an d Since / is 
continuous, A~^,A~ , and B~^ are open sets. Since /(f) = f{—t) for all f € R", we 
have A~^ — —A'^^A~ = -.4“, and B'^ C\{—B'^) = 0. Clearly, A'^nA~ = 0. Finally, 
it follows from /(O) = 1 that 0 € ^4“'^ and 0 ^ B^. 

This paper will show that these obviously necessary conditions on the triple 
(y4‘'‘, A~, B'^) are also .sufficient to insure the existence of an n-dimensional charac- 
teristic function whose real part is positive precisely on A'^ and negative precisely 
on A~ , and whose imaginary part is positive precisely on B^ . Furthermore, this 
characteristic function may be taken to be infinitely (lifferentiable. 

Let -4'* C R" be a closed set satisfying 0 ^ A*^ and A^ = —A'^. Let B'' C 
R” be a closed set containing 0 whose complement can be expressed as 

(B")*^ = B^ U (— B"'"), where B^ is an open .set satisfying B^ fl (— B"'") = 0. It 
follows immediately from the main result that there exists an /t-dimensional 
c:haracteristic function whose real part is zero precisely on A^ and whose imaginary 
part is zero precisely on B”. These sufficient conditions on A^ and B” are obvioiLsly 
necessary. 

Examples of onc'-dimensional characteristic functions with compact support are 
well known. However, the usual examples, and all those obtainable from the famous 
sufficient condition of Polya (see Theorem 6.5.3 of Chung (1974)) are not differen- 
tiable at zero, and the authors are not aware of any previously published examples 
of characteristic functions with compact support. 

2. Construction of the characteristic functions and g 2 ,n 
For a: € R, X ^ 0, define 



Let r(0) = 1, so that r is continuous. 
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Lemma 1. The characteristic function of the probability density (3/2){(l — 
is r. 

Proof. Direct calculation. □ 


Lemma 2. The function r is unimodal and positive. 


Proof. Since r is symmetric and since 7-(0) = 1 and limx-.oc^(a:) = 0, it will suffice 
to prove tliat the first derivative r'(-) has no zeroes for x € (0, oo). But 

/(x) = [(2 + cosx)x — 3siri.x] , 

so that it will suffice to prove that w{-) defined by 

w{x) = (2 + cosx).r — 3sinx 


has no zeroes on (0, oo). It is easy to .see that u,’(x) is positive for x > tt. To take 
care of x G (0, tt), note that 

w' (x) = 2 — 2 cos X — X sin X 

w"{x) = sin.x — .ccosx 

w'"{x) = xsinx 

The third derivative w"'(x) is positive for x 6 (0, tt). Since te"(0) = a/(0) = 
«;(0) = 0, it follows that w{x) is positive for x G (0, tt), and we are done. □ 

Let Xi,X 2 , . . . be . . . random variables witli den.sity (3/2){(l - De'fine 

cx: oo 

5, = 52 and 62 = 52 ^k/k^' . 

k=i k=\ 


Let hi be the density of Si, and let h -2 be the density of ^ 2 . Since k~^ = tt^/ 6, 
the density hi is po.sitive precisely on the interval (— 7 t^/6, 7t^/ 6). Likewi.se, .since 
^^1 A:“‘‘ = 7r'‘/90,/t2 is positive precisely on 7r‘‘/90). 

It follows from Lemma 1 that the characteristic functions of Si and S 2 are given 

9l(^) = n r(x/k-) and q 2 (x) = f] r(x/A-'*), 
k=l k=l 


respectively. 

By the Fourier inversion theorem (see the corollary on j). 155 of Chung (1974)), 


for j = 1,2. Setting t = 0 yields 


e ^^\ij{x)dx, 


/ OO 

qj{x) dx. 

•OO 


Thus, Pj{-) defined by 


Pj(-) 


2Trhj{0) 


is a probability density with characteristic function given by 
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Obviously, gi and are positive precisely on (— tt^/6, tt^/ 6) and (— 7 t‘*/90, 
7t'‘/ 90), respectively. Since r(-) is symmetric about 0 and unimodal, pi and p -2 
are also symmetric and unimodal. From the definitions of r(-) and qj{-) above, it is 
easy to see that 

lim x'^pj{x) = 0 

X— »oo 


for j = 1,2 and for all m > 0. Thus, the densities p\ and p 2 have all moments. It 
follows that gi and §2 are C°°. (See Theorem 6.4.1 of Chung (1974)). Finally, we 
need to show that the tails of p 2 are fatter than those of pi in the sense that, for 
each real a > 0, 


lim 

X— »oo 


Pi (aa^) 
p2{x) 


= 0 . 


(2.1) 


To do this, it will suffice to show that 


lim 

X— *00 


Qi (ax) 
92 (x) 


= 0 . 


( 2 . 2 ) 


If 6, c > 0, then obviously ^ as x — ♦ 00 . Also, if 6 > c > 0, then 0 < < 

1 for all X G R, by Lemma 2. But 


qi{ax) _ r{ax/k^) 
92 (x) aJ. r{x/k^) ’ 


and the kth factor converges to {a^k'^)~^. There are only finitely many k’s for which 
{a‘^k'^)~^ > 1. If {a^k‘^)~^ < 1, then 0 < ^ limiting 

value (a^Ar**)"^ can be made arbitrarily small by choosing k sufficiently large. This 
suffices to prove (2.2) and hence (2.1). 

Define . 91 , < 72 ? Pi » and ]>2 by rescaling pi,P 2 »Pi» and p 2 follows. 

9i (0 = 9i . 92(0 = 92 (7T‘‘t/90) 

Pi(x) = (6/7 t^)pi (6x/7t^) P‘2 {x) = (90/7r^)p2(90x/7T'*). 

Our results for gi,g 2 ,Pi, and p 2 imply the results for 91 , 92 , Pi, and p 2 given in 
the following lemma. 

Lemma 3. The functions g\ and p 2 defined above are real-valued, nonnegative, 
characteristic functions which are positive precisely on (—1,1). The corresponding 
probability densities pi andp 2 are unimodal, and the tails ofp 2 are fatter then those 
of Pi in the sense that, for every a> 0, ^p^{x) ~ 

In order to prove our main theorem, we will need an n-dimensional analog of 
Lemma 3. For the remainder of this paper, t and x will denote points in R” with 
respective coordinates L and Xj, i = 1, . . . , n. 

For j=l and 2, let Yj be a random vector in R" whose coordinates are i.i.d. 
random variables with density pj. Then Yj has density 


n 

Pj.nix) = 

i=l 


and characteristic function 
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9j,n{t) — 

i=l 

Let M be a random n x n orthogonal matrix (with the normalized Haar measure 
on the group of n x n orthogonal matrices as its probability distribution), and 
suppose M is independent of Yj. Then Zj = MYj is a spherically symmetric 
random vector in R" with density 

Pj,n{^)= / 

ys"-‘ 

where = {t € R" : ||t|| = 1} is the unit sphere in R", and v is the rotation 
invariant probability measure on The characteristic function of Zj is 

9j,n{i)= / ^j,„(||i|lM)di;(n), 

75n-l 

which is and is positive precisely on {t 6 R" : ||t|| < \/n}. For j=l and 2, let 

9j,n(t) = hr^{\/nt) (2.3) 

and 

Pi,n(x) = n"‘/2pj>(n“‘/2a;). (2.4) 

The following lemma gives us the results we lU'ed to prove the main theorem. 


Lemma 4. The functions and p 2 ,u defined above are real-valued, nonnegative, 
characteristic functions which are positive precisely on {t € : ||t|| < 1}. For 

each a > 0, there is a constant L{a) such that the corresponding densities functions 
pi^n o,nd p 2 ,n satisfy 

pi,n(ax) < L{a)p2,n{x) 


for all X € R" . 


Proof. Only the second assertion remains to be proved. Fix a > 0. It follows from 
Lemma 3 that there exists a number A' (a) > 0 such that pi(axi) < K{a)p- 2 {xi) for 
all xi € R. Thus 


Pi,n(ox) = ji[pi(axi) < A"'(a) ji[p 2 (xj) = A'”(a)p 2 .„(a:) 


i=l 


«=1 


Furthermore, 


Pi,n{ax) 


f pi,n{a\\x\\tt)dv{u) < K’^{a) f p 2 ,«(lk||u) (in(«) 
F'"ia)p2,n{x). 


Let L(a) = A”(a). Then it follows from (2.4) that pi,n(u^) < L{a)p 2 ,n{x) for all 
x€R". □ 


Remark. It is not hard to show that the spherically symmetric densities pi,„ and 
P 2 ,n are unimodal, and that, for each a > 0, they satisly 


lim 

||x||-.OC 


Pi.n(aa:) 

P2.n(x) 


= 0. 


We will only need the facts given in Lemma 4, however. 


Copyrighted material 


168 


H. Rukm and T. M. Sellke 


3. The main theorem 

Theorem. Let A'*', A~, and be open subsets of R" satisfying .4'*‘ = .4” = 

— v4“, B"*" = 0,.4"*'Q.4“ = 0,0 € A^ , and 0 ^ . Then there exists an 

infinitely differentiable characteristic function f on R" satisfying 

A+ = {t€R":Re(/(f))>0} 

A- = {<€R":Re{/(t))<0} 

B+ = {tGR":Im(/(t))>0}. 

Proof. For c S R" and r a positive constant, let 

B,(c) = {«eR":|t<-c||<r} 

be the open ball in R" with center c and radius r. We may assume without loss of 
generality that B[(0) C •4’'‘. Define 

.4+ = /l+p|{fGR":||t||>l/2}. 

Since A'^ is open, it is the union of a countable set {Br,(ci)}j^, of open balls. 
Since A'*' = -i4’*', we have Br,(— c,) C A'*' for all i. Define 

//■(<) = .9i,n{(f-Ci)/r, } +pi,„{(f 4- Ci)/ri}. 

By I>emma 4, f^ is positive precisely on Br, (cj) (J Br,(— Cj). Taking a Fourier trans- 
form yields 


(2tr)-" / e-<-‘>/+(f)df = 

= 2riCos(x-c,)pi,„(r,a;) 

(Se<» Theorem 7.7(c) of Rudin (1973)). 

Lcl {®>}Si be a sequence of positive constants satisfying a, < 2~‘~^ 
X {2rjL(r,)}“'. Then 


. CK 

(27T)-" / e-(-'>53a./+(f)df 


— 1 1 

< ^2"'“^{L(rj)}~ pi,„(r,i) < -p 2 ,„(ar). 


Furthermore, by choosing the aj’s to converge to zero sufficiently fast, w'e can insure 
that /■*■(•) defined by 

/+(f) = £aj+(f) 

t=l 

Ls and in Z,'(R"). Note that the real-valued, nounegative function f'*'{-) is 
nonzero precisely on yf’*'. 

Let {Br'(cJ)}j^i be a secpience of open balls whose union is .4“, and let 

/r(0 = -9i.n{(^ -c!)/f.'} -9i.n{(< + c')/r'}. 

The same argument iLsed above shows that we can choose a sequence of |K)sitive 
constants .such that /“(•) defined bj' 


f-{t) = Y. 0 dr(t) 

i=l 
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is C^, in L’(R"), and satisfies 

Jr** 


< jP2,n(l)- 


Note that tlie real-valued, nonpositive hinction / (•) is nonzero precisely on A . 
Let {Br"(c|*)}Si ® sequence of open balls whose union is B* . Let 


Then 


- c")/r''} - (t + c'')/r"}] 

= -2r,"sin(xc'')pi,„(r''j;) 

Again, we can choase a sr^quence of positive constants { 7 ,}j=i .so that /""(■) 
defined by 




is C*, in L*(R")i ^nd satisfies 


(27T) 


-n f 




< ;jPZ,n(T)- 


Note that the function /""(•) is pure imaginary, and that its imaginary part is 
positive precisely on B+. 

Now let 

/(«)= f72.n(f) + /■"(') + r(0 + /”"«)• 

Clearly the real and imaginary parts of / are positive and negative on the projx'r 
sets. The function / is C^, and in L'(R"). 

Define 


Since 


and 


we have 


p(x) = (2tr)-" / 

(2;:)-" / e-'<-‘»(/+(f)-l-/-(t) + /-'"(t))< 
JK>‘ 

(2ff)-" / e-‘<-Vn(0rf<=P2,,. 
Jro 




^P2.n(x) < P(X) < 2p2.„(x). 

By the Fourier inversion theorem (again, see Theorem 7.7(c) of Rudin (1973)), 

/(t) = f e'**‘*p(i) dx. 

./R« 

Also, since /(O) = .92,»(0) = L "'e have 


L 


p{x)dx = /(()) = 1. 


Thus, /is the characteristic function of tlie prolrability density p, and / satisfies all 
the requirements of the theorem. □ 
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Addendum 

Except for slight corret;tions, the present paper was completed in 1984. Results 
very similar to the one-dimensional version of our main theorem appear in Sasvari 
(1985). 
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type IV distributions 
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Abstract: Using an identity of Stein (1986), this article gives an exact expres- 
sion for the characteristic function of Pearson type IV distributions in terms 
of confluent hypergeometric functions. 


1. Introduction 


Pearson (1895) introduced a family of probability density functions where each 
member p of the family satisfies a differential equation 


pC'H = - 


a + w 
+aiw + oo 


pM< 


( 1 ) 


for some constants a, oo, oi and 02 - The Pearson family is very general and it 
includes many of the probability distributions in common use today. For example, 
the beta distribution belongs to the class of Pearson type 1 distributions, the gamma 
distribution to Pearson type III distributions and the t distribution to Pearson type 
VII distributions. 

This article focuses on the Pearson type IV distributions. These distributions 
have unlimited range in both directions and are unimodal. In particular, Pearson 
type IV distributions are characterized by meml)ers satisfying (1) with 0 < 02 < 1 
and the equation 

02W^ + ajU) + Oo = 0 

having no real roots. Writing Ao = oo — ai(4a-j)~^ and A\ = 01 ( 202 )'*, it follows 
from (1) that a Pearson type IV distribution has a probability density function of 
the form 


p{w) = 


[A: + 02(11; + 


ex]> 


a - A\ u; + Ml 


'iweR, 


where A is the normalizing constant. It is well known that Pearson type IV dis- 
tributions are technically difficult to handle in practice (Stuart and Ord (1994), 
page 222). Johnson, Kotz and Balakrishnan (1994), page 19, noted that working 
with p(w) often leads to intractable mathematics, for example if one attempts to 
calculate its cumulative distribution function. 

The main result of this article Is an exact expression (see Theorem 2) for the 
characteristic fiinction of a Pearson type IV distribution in terms of confluent hyper- 
geometric functions. We note that we have been unable to find any non-asymptotic 
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Keywords and phrases: characteristic function, confluent hypergeometric function, Pearson 
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closed-form expression for the characteristic function of a Pearson type IV distrib- 
ution in the literature. 

The approach that we shall take is inspired by the results of Stein (1986) on 
the Pearson family of distributions. Since confluent hypergeometric functions have 
an extensive literature going back over two hundred years to Euler and Gaikss, it is 
plausible that Theorem 2 may provide us with a way of understanding the behavior 
of Pearson type IV distributions better in a more rigorous manner. 

For example, one possible use of Theorem 2 is that we can now apply Fourier 
analytic techniq\ics in combination with Stein’s method [see Stein (1986)] to obtain 
Pearson ty[)e IV approximations to the distribution of a sum of weakly dependent 
random variables. This work is currently in progress and hence will not l>e addressed 
here. The hope is that such a Pearson type IV approximation would have the same 
order of accuracy as that of an one-term Edgeworth expansion [see, for example. 
Feller (1971), page 539] with the (often desirable) property that the Pearson type 
IV approximation is a probability distribution whereas the one-term Edgeworth 
expansion is not. 

We should also mention that besides one-term Edgeworth approximations, 
gamma and chi-square approximations exLst in the literature [see, for example, 
Shorack (2000), page 383], The latter approximations typically have the same or- 
der of accuracy as the former. However, gamma and chi-square approximations are 
supported on the half real line and may be qualitatively inappropriate for .some 
applications. 

Finally throughout this article, I{.} denotes the indicator function and for any 
function /i : f? — * /?, we write as the rth derivative of h (if it exists) whenever 
r = 1.2,--. 


2. Pearson type IV characteristic function 

We .shall first state an identity of Stein (1986) for Pearson ty|)e IV distributions. 

Theorem 1 (Stein). Let p be the pTvbability density function of a Pearson type 
IV distribution satisfying 


p^'Hw) = - 


{2a, + l)w + a, p^^) 
Ct2W^ + QjU) + Qo 


Vu’ € P. 


(2) 


for some constants oo, Oi and oj. Then for a given bounded piecewise, continuous 
function h : R — R, the differential equation 


(o 2 tc^ + oite -t- no)/*'*(te) — «;/(«?) = /i(ui), Vw £ R, (3) 


has a bounded continuous and piecewise continuously differentiable solution f : R —> 
R if and only if 

I h{w)p{w)dw — 0. (4) 

When (4) is satisfied, the unique bounded solution f of (3) is given by 

Mx) „„/r 

\Jx O2r + oiy + ooy 

ydy 


iM = r 

J ~3C 

= -L 


021 * -I- Ol X -I- Oo 
" Mx) 

02x''* -(- nil + Oo 


exp 

exp 


(-/' 


a^y^ + (iiy + Oo , 


I dx, Vu' € R. 
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We refer the reader to Stein (1986), Chapter 6, for the proof of Tlieorein 1. 

Let Z be a random variable having proliability density function p wliere p sat- 
isfies (2). 


Proposition 1. Let Z be as above and ipz be its characteristic function. Then V’z 
satisfies the foUovring homogeneous second order linear differential equation: 

-I- tQo0z(t) - <Q2V’2^(<) - = 0. 'dteR. (5) 

Proof. Since ipz(.t) = Ee‘‘^ , t € R, vie oljserve from Tlieorem 1 tliat 


f [(o2ie'^ + Qiw + Qo)-^(e‘‘“') - we‘''“\p(w)dw 

J-oo 

= / [tt(Q 2 te^ + oiu.' -(- oo)e'*“' — «’c*'"’]p{u))rfw' 

J — 30 

= (Q2te^ -I- Qi«j -t- Qo)e'’“’p(«')|roc ~ / 1(202 + l)ti.' -I- ai]e'‘'^'p(u;)rfu' 

J—3C 

— f (o2ie^ + oiu) -t- Qo)e'‘"’p*'*(ii 
J — oc 


{w)dw 


= 0 . 


Hence we conclude that 

-itQ2i>^z^(t) taixpz\^) + Ronrpz(t) + = 0, Vf € R. 

This proves Proposition 1. □ 


Definition. Following Slater (I960), pages 2 to 5, we define the confluent hyper- 
geometric function (with complex-valueii parameters a and b) to be a power series 
in X of the form 


iFi(a;b;x) 


^ («)j^ 


where (a)j = a(a 1) ■ • • (a + j — 1), etc. and b Ls not a negative integer or 0. We 
further define 

-‘i f"! ( 1 + n - 6; 2 - 6; I). 

1 (1 + a — 0} I (a) 

Remark. It is well known [see for example Theorem 2.1.1 of Andrews, Askey and 
Roy (1999)1 that the series iFi(a;6;x) [and hence f/(a;6;x)[ converges absolutely 
for all X. 

The theorem below establishes an explicit expression for 


Theorem 2. Let ipz be as in Proposition 1, 



\/ 4oo 02 ~ rtf lOi 

2Q2 2o2 

y^4oo02 — Q? ioi 
2^2 2^2 


u 



( 6 ) 
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and ko 2 1 /or all k = 1, 2, • • ■ . Then for t £ R, we have 

«»> - • 


Remark. We would like to add that the confluent hyporgeonietric function f/(.; , ; .) 
is available in a number of mathematical software packages. For example in Math- 
ematica (Wolfram (1996)], 


HypergeometricUfa, b, x] 
is the comnuuid to evaluate U{a;b\x). 

Proof of Theorem 2. We observe from (5) that for all t £ R, 

t'tfW + = 0. (T) 

02 CV2 <^2 

Step 1 . Suppose that i > 0. We seek a solution of the above differential ecpiation 
that has the form 

OO 

= e“'‘‘ ^ CjP, VO < t < oc, 

7=0 

for complex constants co,ci, • ■ •. Observing that 

OO OO 

cjP 4- e~'’^ jcjP~^, 

7=0 7=1 

00 OO OO 

- 2re~''^ ^jCjP~^ + e"’"' ^j(j - l)cjP~^, 

7=0 7=1 7=2 


and substituting these expressions into the left hand side of (7), we have 


00 


OO 


00 


r^e~^^ ^ — 2re~’’‘ + e~’’* ^/(j — l)Cj^ * 

7=0 7=1 J=2 

1 ■. OO OO OO 

+(-± + +6-‘ (8) 

ry^'y O''! * • ■ * /^r» • • 


02 02 
= 0, VO < i < OO 


i=o 


j=i 


02 


7=0 


Equating the coefficient of in (8) to zero, we have 


!!2>_£L = o, 

02 02 

and equating the coefficient of P,J = 1,2, • • • , in (8) to zero, we have 


- 2rjcj +j(J + l)cj+i + 

= 0 . 


rcj (j + l)cj-n ioircj-i ioijcj opCj-i 


02 


Q2 


Oo 


+ 


»2 


02 
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This implies that C) = rc<), and in general for j = 2,3 
1 


{Cj_i(r - 2r(j - 1)«2 + i(J - I)oi] 


+Cj_2(-Qo + - i>«i)} . 

whenever ta „,2 / 1,^ = 1,2, ■ • We observe from (6) that r satisfies 

r^02 — ioir - ao = 0. 

Since 4an02 > a? (from the definition of Pearson type IV distributions), we con- 
clude that 

02 ^ 


= oj-2 n { 

k-j-l 

- '.n{ 

*=1 


|r - 2r(j - l)a *2 + i(j - l)ai 

} 


r — 2r(A: — l)o 2 -I- i(k — l)oi ^ 
fc]! ~{k- 1)q2| 


r — 2r(k — l)o 2 -I- i{k — l)oi 


}■ 


Vj = 1,2,-- 


and hence for < > 0, 

= Coe" 

= CoC‘ 


A-[l-(fc-l)Q2l 

r — 2r{k — l)o 2 -I- i(k — l)a 


'"'Ef n{ 

Jb=l 


1 - (A: -1)^2 


4 


-r|«| Y' TT f (^' ~ 1 )^ ~ o 1 


^ 02 t>2 

Step 2. Suppose that < < 0. Writing ^ = —t and «z(^) = 02(1), we have 

4‘’(o = ^5 = -4‘’(o. 


(9) 


and 


dl 


d^' d^ ' dt 
Consequently, (5) now takes the form 

+ )«y*(0-— «z(0 = o- v^>o. (10) 

02 ^2 ^2 

We seek a solution of the above differential equation that has the form 

oc 

n(^) = e~’’^ ^ dj?"’, VO < ^ < oc, 

j=0 

for com])lex con.stants do,di,---. Arguing as in Step 1, we olwerve that for t = 


A 02 02 


(H) 
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Since a solution of (7) is continuous at t = 0, we have co = do- Thus we conclude 
from (9) and (11) that a sohition of (7) is 

m = ^)I{t > 0} + ^)I{f < ()}. 

^ «2 ^2 ^ (^2 ^2 

( 12 ) 

Step 3. Suppose that f > 0. VVe seek a solution of (7) that has the form 

oc 

0(f) = e~’’‘ ^2 VO < f < 00 , 

j=0 

for complex constants co, ci, ■ • Observing that 

OO (X> 

0<‘)(f) = -rc-’''53cjf'+-'+e-''‘53(i/ + j)Cjf‘'+-’-', 

j=o >=o 

OC OC 

0<*)(f) = r'^e-’''^Cjf‘'+-’ -2re-’‘'^(i/ + j)c^f'+J-‘ 


1=0 


1=0 


+e-^'^(p + j)(p + j-l)c,r+>-^ 

j=0 

and substituting these expressions into the left hand side of (7), we have 

OC OC OO 

r^e--’-' qr+-’+' - 2rc-’-' ^^(t/ + j)cjf*'+-’ + e”’'' + j)(" + l)cjf'+^‘' 

y=o >=o j=o 


i=o 


1=0 


_rl » 


+ + e-’’* 5^(p + j)Cjt‘'+i] - 

>=0 >=0 j=0 

= 0. VO < f < OO. 

Equating the coefficient of f" in (13) to zero, we have 

o . ^ , '■Co (p+l)ci , ioiPCo „ 

— 2rpco + (p + l)pci H + 0, 

Ck2 Oi2 ^2 

and eejuating the coefficient of f""*’-'"* , j = 2, 3, • • • , in (13) to zero, we have 

'•Cj-l (P + j)Cj 


(13) 


r^Cj _2 - 2r{w + j - l)cj_i + (p + j)(p + j - l)cj + 

+ — (-rcj_2 + {11 + j- l)cj-,l - =0. 

(>2 OC2 


rt2 Oi2 


This gives 


_ 2r«2 + r — ioiP 

02(1 + P) 


and in general for j = 2, 3, • • • , 
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^3 ~ 


{[2{u + j - l)ra 2 - r - iai{u + j - l)]cj_i 


j{u + j)02 

~{a2r^ - iQir - o:o)cj_2} 

2{u + k — l)ro 2 — r — iQi{u + k — 1) 


= c„n 

fc=i 


k{u + k)a2 


Hence for < > 0, we have 


ip{t) 


rt ^ TT 2(t/ + k- l)rQ 2 - r - ioi(t/ + A: - 1) 
^ k{v + k)Q 2 


j=0 k=l 


= cot‘'e ''\Fi{u - + 1;—). (14) 

lA 02 


Step 4. Suppose that t < 0. Writing ^ = -t and mz( 0 = we seek a 

solution of (10) that has the form 


u(0 = VO < ^ < 00, 

i=o 

for complex constants do,di,-‘ - Arguing as in Step 3, w'e oljserve that for t — 

-^<0, 


W ^ ii' + k-\)A-f 


a(0 = 

ZA 02 


(15) 


Since a solution of (7) is continuous at t = 0, we have co = do- Thus we conclude 
from (14) and (15) that a solution of (7) is 


ipit) 


+ 1; M)I{, > 0} 

ZA Q2 

+|«re-''w,F,(.' - + 1; — )r!< < 0}- 

ZA ft2 


(16) 


As the solutions in (12) and (16) are independent, the general solution of (7) is 
given by 


tp(t) 


= { 


A Q2 Q2 


+B|tre-^l‘l,F,(,/ - x;" + >; — )ll(' s 0 } 

A 02 J 


{■ 


A 02 02 


+ b\t\''e-^"\F,(v - xl" + 1 ; — )| < «}. 

A 02 J 
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where A.A.B and B are arbitrary constants. Consequently since V’z(O) - 1, we 
have A = A = I and 


^z{t) 


= {■ 


Za 02 


+B|tre-'-l‘l,F,(«/- ^;i/+ >0} 

A 02 J 


{• 


Za 02 


+ B|tre-'I‘I,F,( 1 / - x: — )| < 0}. (17) 

A 02 J 

for some constants B and B. 

Step 5. To complete the proof of Theorem 2, it suffices to determine the con- 
stants B and B in (17). We observe from Slater (1960), page 60, that for i — > oo, 

,F,(a;6;i) = x“-V^(l -I- 0(|i|-'). 


r(a)' 


Hence it follows from (17) that as ( — * oo, 

(( — )•' 
I 02 


1 (~t*2 *) 


\-l-rA~* / / ^ \e 1 (~P2 ) 

'o 2 ^ 1^02^ r(-rA-') 


+ B 


r{ 


Since lim(_oo V'z, (0 = 0, we have 


r(p + i) 1 

V — rA'*) / 


(1+0(1)). 


^ A r(-o,-‘)r(o-rA-‘) 
'o2^ r(i/+ i)r(-rA-‘) ■ 


Similarly as f — * — oo, 




Sin«! lim/__oo V'z(f) = •)> we have 

f. ,A r(-gj»)r(p-fA-- 
02^ r(i/ + i)r(-fA->) 


(18) 


(19) 
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Theorem 2 now follows from (17), (18), (19), the definition of f/(.;.;.) and Euler’s 
reflection formula, namely 


[see, for example, Theorem 1.2.1 of Andrews. Askey and Roy (1999)]. □ 
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and goodness of fit 
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Abstract: We present a general proposal for testing for goodness of fit, based 
on resampling and subsampling methods, and illustrate it with graphical and 
analytical tests for the problems of testing for univariate or multivariate nor- 
mality. The proposal shows promising, and in some cases dramatic, success in 
detec ting nonnormality. Compared to common competitors, such as a Q-Q plot 
or a likelihood ratio test against a specified alternative, our proposal seems to 
be the most useful when the sample size is small, such as 10 or 12, or even 
very small, such as 6! We also show how our proposal provides tangible infor- 
mation aljout the nature of the true cdf from which one is sampling. Thus, our 
proposal also has data analytic value. Although only the normality problem 
is addressed here, the scope of application of the general proposal should be 
much broader. 

1. Introduction 

The purpose of this article is to present a general propasal, based on re or subsam- 
pling, for goodness of fit tests and apply it to the problem of testing for univariate 
or multivariate normality of iid data. Based on the evidence we have accumu- 
lated, the proposal seems to have unexpected success. It comes out especially well, 
relative to its common competitors, when the sample size is small, or even very 
small. The common tests, graphical or analytical, do not have much credibility for 
very small .sample sizes. For example, a Q-Q plot with a sample of size 6 would 
be hardly credible; neither would be an analytical test, such as the Shapiro- Wilk, 
the Anderson-Darling or the Kolmogorov-Smirnov last with estimated parameters 
(Shapiro and Wilk (1965), Anderson and Darling (1952,1954), Stephens (1976), 
Babu and Rao (2004)). But, somewhat mysteriously, the tests based on our pro- 
po.sal seem to have impressive detection power even with such small sample sizes. 
Furthermore, the propo.sal is general, and so its scope of application is broader than 
just the normality problem. However, in this article, we choose to investigate only 
the normality problem in detail, it being the obvious first application one would 
want to try. Although we have not conducted a complete technical analysis, we still 
hope that we have presented here a useful set of ideas with broad applicability. 

The basic idea is to use a suitably chosen characterization result for the null hy- 
pothesis and combine it with the bootstrap or sul^siimpling to produce a goodness 
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of fit test. The idea lias been mentioned previously. But it has not been investigated 
in the way or at length, as we do it here (see McDonald and Katti (1974), Mud- 
holkar, McDermott and Srimstara (1992), Mudholkar, Marchetti and Lin (2002) 
and D’Agostino and Stephens (1986)). To illustrate the basic idea, it is well known 
that if Xi, Xj, . . . , X„ are iid samples from some rdf F on the real line with a finite 
variance, then F is a normal distribution if and only if the sample mean X and 
the sample variance are inde|)endent. and distributed respectively, as a normal 
and a (scaled) chisquare. Therefore, using standard notation, with G,„ denoting 
the cdf of a chisquare distribution with m degrees of freedom, the random variables 
U„ = an(j = G’n-i( ^"~}^* ) would be independent t/[0, 1) random 

variables. Proxies of f/„, V„ can be cotnputetl, in the usual way, by using either a 
resample (such as the ordinary bootstrap), or a subsample, with some subsample 
size 6. These proxies, namely the pairs, tc* = (f/*,V')’) can then be plotted in the 
unit square to visually assess evidence of any structured or patterneti deviation 
from a ramlom uniform like .scattering. They cati also 1 k> used to coiLstruct formal 
tests, in addition to graphical tests. The use of the univariate normality problem, 
and of X and arc both artifacts. Other statistics can be ttsed. and in fact we 
do so (interquartile range/.s and ,s. for instance). VVe also investigate the multi- 
variate normality problem, which remains to date, a notoriously difficult problem, 
especially for small sample sizes, the rase we most emphasize in this article. 

We l>egin with a quantification of the statistical folklore that Q-Q plots tend 
to look linear in the central part of the plot for many types of nonnormal data. 
We present these results on the ()-CJ plot for two main reasons. The precise quan- 
tifications we give would be sur|>ri.sing to tnany |>eople: in addition, these results 
|)rovi<ie a background for why complementary graphical tests, such as the ones we 
offer, can be u.seful. 

The resampling based graphical tests arc presented and analyzed next. A charm- 
ing property of our resampling based test is that it does not stop at simply detecting 
noimormality. It gives sultstantially more infortnation about the nature of the true 
cdf from which one is sampling, if it is not a normal cdf. We show how a skillful 
analysis of the graphical test would produce such useful information by looking at 
key features of the plots, for itistance, empty corners, or a pronounced trend. In 
this sense, our proposal also has the flavor of being a useful data analytic tool. 

Subsampling based tests are presented at the end. But we do not analyze them 
with as much detail as the resatnpling based tests. The main reason is limitation 
of space. But comparison of the resampling based tests and the test base<l on 
.subsampling reveals quite interesting phenomena. For example, when a structured 
deviation from a uniform like scattering is .se«'n, the structures are different for the 
re and sulxtanipling basetl tests. Thas, we seem to have the situation that we do 
not need to necessarily choose otie or the other. The resatnpling and sultsatnpling 
based tests complement each other. They can both be used, as alternatives or 
complements, to common tests, and especially when the sample sizes are small, or 
even very small. 

To summarize, the principal contributions and the salient features of this article 
are the following: 

1. W’e suggest a flexible general proposal for testing goodness of fit to parametric 
families basetl on characterizations of the family; 

2. We illustrate the method for the problt'ins of testing univariate and multi- 
variate normality; 
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3. The method is based on re or subsampling, and tests based on the two methods 
nicely complement each other; 

4. Graphical tests form the core of our proposal, and they are especially useful 
for small sample sizes due to lack of credible graphical tests when the sample 
size is small; 

5. We give companion formal tests to our graphical tests with some power stud- 
ies; but the graphical test is more effective in our assessment; 

(i. We provide a theoretical background for why new graphical tests should be 
welcome in the area by providing some precise quantifications for just how 
misleading Q-Q plots can be. The exact results should be surprising to many. 

7. We indicate scope of ad<litional applications by discussing three interesting 
problems. 

2. Why Q-Q plots can mislead 

The principal contribution of our article is a proposal for new resampling based 
graphical tests for goodness of fit. Since Q-Q plots are of wide and universal use 
for that purpose, it would be helpfid to explain why we think that alternative 
graphical tests would be useful, and perhaps even needed. Towards this end, we 
first provide a few technical results and some numerics to illustrate how Q-Q plots 
can be misleading. It has been part of the general knowledge and folklore that Q-Q 
plots can be misleading; but the results below give some precise explanation for 
and quantification of such misleading behavior of Q-Q plots. 

Q-Q plots can mislead because of two reasons. They look approximately linear 
in the central part for many types of nonnormal data, and because of the common 
standard we apply to ourselves (and teach students) that we should not overreact to 
wiggles in the Q-Q plot and what counts is an overall visual impression of linearity. 
The following results explain why that standard is a dangerous one. First some 
notation is introduced. 

The exact definition of the Q-Q plot varies a little from source to source. 
For the numerical illustrations, we will define a Q-Q plot as a plot of the pairs 
(^(i-i/ 2 )/ni where z„ = 4>~'(1 — q) is the (1 — a)th quantile of the stan- 
dard normal distribution an<l is the ith sample order statistic (at other places, 
«(i-i/ 2 )/n is replaced by Z(j+i/ 2 )/(n+l)i 2(i+l/2)/(n+3/4), etc. Due to the asymptotic 
nature of our results, these distinctions do not affect the statements of the results). 
For notational simplicity, we will simply write z; for Z(i_i/ 2 )/n- The natural index 
for visual linearity of the Q-Q plot is the coefficient of correlation 


As we mentioned above, the central part of a Q-Q plot tends to look approxi- 
mately linear for many types of nonnormal data. This necessitates another index 
for linearity of the central part in a Q-Q plot. Thus, for 0 < o < 0.5, we define the 
trimmed correlation 
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where k = [no], and Xk is the corresponding trimmed mean. In other words. r„ is 
the correlation in the Q-Q plot when 100o% of the points are deleted from each 
tail of the plot. r„ typically is larger in magnitude than r„, as we shall see below. 

VVe will assume that the true underlying CDF F Is continuous, although a 
number of our results do not require that assumption. 


2.1. Almost sure limits of r„ and r„ 

Theorem 1. Let Xi, X 2 ,. . . , X„ be iid observations from a CDF F with finite 
variance . Then 


■ P(F) = 


fo F ‘(x)^ '(x)dx 


with probability 1. 


Proof. Multiply the numerator as well as each term within the square-root sign in 
the denominator by n. The term converges to dx, being a 

Riemann sura for that integral. The second term L )* converges a.s. 

to by the usual strong law. Since fg(d>~'(x))^ dx = 1, on division by n, the 
denominator in r„ converges a.s. to <r. 

The numerator needs a little work. Using the same notation as in Serfling (1980) 
(pp. 277-279), define the double sequence t„i = (i — l/2)/n and J(t) = 4>“'{f). 
Thus J is everywhere continuous and satisfies for every r > 0 and in particular for 
r = 2, the growth condition |./(t)l < A/[f(l - t)]*/''-i+< for some S > 0. Trivially, 
maxi<j<n |tn. - »7n| — > 0. Finally, there exists a positive constant a such that 
a. mini<i<„{j/n, 1 — t/n} < <ni < 1 — «• mini<i<„{i/n, 1 - t/n}. Specifically, this 
holds with a = 1/2. It follows from Example A and Example A’ in pp. 277-279 
in Serfling (1980) that on division by n, the numerator of r„ converges a.s. to 
/o* F~^ {x)^~^ (x) dx, establishing the statement of Theorem 1. 

The almost sure limit of the truncated correlation r„ is stated next; we omit its 
proof as it is very similar to the proof of Theorem 1. □ 


Theorem 2. Let X\, Xj, . . . , X„ be iid observations from a CDF F. Let 0 < o < 
0.5, and 


P., = 


jA-(g) ''^x<iF(x) 


Then, with probability 1, 
r„ -♦ P„{F) = 


1 -2q 

fa’" F~'(x)^-^(x)dx 


y/ll “(^~'(a:))*<tr-//li'/„‘) "*(x - Pa)* dF(x) 


Theorem I and 2 are iu>ed in the following Table to explain why Q-Q plots show 
an overall visual linearity for many types of nonnormal data, and especially so in 
the central part of the plot. 


Discussion of Table 1 

We see from Table 1 that for each distribution that we tried, the trimmed cor- 
relation is larger than the untrimmed one. We also sec that as little as 5% trim- 
ming from each tail produces a correlation at least as large as .95, even for the 
extremely skewed Exponential case. For symmetric populations. 5% trimming pro- 
duces a nearly perfectly linear Q-Q plot, a.symptotically. Theorem 1, Theorem 2, 
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Table 1: Limiting correlation in Q-Q plots. 


F 

No Trimming 

5% trimming 

Uniform 

.9772 

.9949 

Double Exp. 

.9811 

.9941 

Logistic 

.9663 

.9995 

t(3) 

.9008 

.9984 

t(5) 

.9832 

.9991 

Tukey distribution 

.9706 

.9997 

(defined as .9N(0,1) + .1N(0,9)) 

chisquare(5) 

.9577 

.9826 

Exponential 

.9032 

.9536 


and Table 1 vindicate our common empirical experience that the central part of 
a Q-Q plot is very likely to look linear for all types of data; light tailed, medium 
tailed, heavy tailed, symmetric, skewed. Information about nonnormality from a 
Q-Q plot can only come from the tails and the somewhat pervasive practice of 
concentrating on the overall linearitj’^ and ignoring the wiggles at the tails renders 
the Q-Q plot substantially useless in detecting nonnormality. Certainly we are not 
suggesting, and it is not true, that everyone uses the Q-Q plot by concentrating 
on the central part. Still, these results suggest that alternative or complementary 
graphical tests can be useful, especially for small sample sizes. A part of our efforts 
in the rest of this article addre.ss that. 


3. Resampling based tests for univariate normality 
3.1. Test based on X and s^ 


Let Xi, X 2 , . ..,X„ be iid observations from a N{fi, a^) distribution. A well known 
c-haracterization of the family of normal distributions is that the sample mean X 
and the sample variance are independently distributed (see Kagan, Linnik and 
Rao (1973); a good generalization is Parthasarathy (1976). The generalizations due 
to him can be ased for other resampling based tests of normality). If one can test 
their independence using the sample data, it would in principle provide a means 
of testing for the normality of the underlying population. But of course to test the 
independence, we will have to have some idea of the joint distribution of X and 
and this cannot be done using just one set of sample observations in the standard 
statistical paradigm. Here is where resampling can be useful. 

Thus, for some R > 1, let X/j, X,* 2 , . . . , X^^, i = 1, 2, . . . , B be a sample from 
the empirical CDF of the original sample values X\,X 2 , . . . , Define, 



1 


n 






and 


j=i 


Let $ denote the standard normal CDF and Gm the CDF of the chisquare 
distribution with m degrees of freedom. Under the null hypothesis of normality, the 
statistics 



are independently distributed as f/[0, 1]. 

Motivaterl by this, define: for i = 1, 2, . . . , R, 
VH(X'-X) \ 


u: = 'S 


and 


Vi 


= Gn-x( 


(^t- l)sf 
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wl = = 1,2, If the null hypothesis Ls true, the w‘ should 

be roughly uniformly scattered in the unit square [0, 1] x [0, 1]. This is the graphical 
test we propose in this section. A subsainpling biiserl ttsit using the same idea will 
be described in a subsequent section. We will present evidence that this resampling 
based graphical test is quite effective, and relatively speaking, is more useful for 
small sample sizes. This is because for small n, it is hard to think of other procedures 
that will have much credibility. For example, if n = 6, a case that we prescuit here, 
it is not very credible to draw a Q-Q plot. Our resampling based test would be 
more credible for such small sample sizes. 

The following consistency theorem shows that our method will correctly iden- 
tify the joint distribution of asymptotically. Although we use the test in 

small samples, the consistency theorem still provides some necessary theoretical 
foundation for our method. 

Theorem 3. Using standard notation, 

sup |P. (f/' <u.V' < v) - Pf{U„ < u. F„ < u) I ^ 0 

0<ti<l,0<t;<l 

in probability, provided F has four moments, where F denotes the true CDF from 
which X], A' 2 , . . . , are iid observations. 


Proof. We observe that the ordinary bootstrap is consistent for the joint distribu- 
tion of (X,.s^) if F has four moments. Theorem 3 follows from this and the uniform 
delta theorem for the bootstrap (see van der Vaart (1998)). □ 


Under the null hypothesis, (f/„, V^) are uniformly distributed in the unit square 
for each n, and hence also asymptotically. We next describe the joint asymptotic 
distribution of ({/„, K,) under a general F with four moments. It wall follow that 
our test is not consistent agaiitst a specific alternative F if and only if F has the 
same first four moments as some distribution. FYom the point of view of 

common statistical practice, tliis is not a major drawback. To have a test consistent 
against all alternatives, we will have to ase more than X and s^. 

Theorem 4. Let X\ , X 2 , ...,X„ be iid observations from a CDF F with four 
finite moments. Let denote the third and the fourth central moment of F, 

and K= Then, 

(t/„, Un) //, where. H has the density 

V ^ («-l)<T« 

X [2v/2/e,<7'*'l>“'(u)'l>~*(u) + (k - 3)<7®(<I>-‘(i'))'' 

-p§((^-'(u))' + (<h-‘(t;))")]}. (1) 


Proof. Let 



_ Vn(s^ — a'^) 
y/hi - 


Then, it is well known that (Zi„,Z 2 „) =>■ (Zi,Zj) ~ Af(0,(),S), where E — ((<Tjj)), 
with (Til = 1. = and (T 22 = I . 
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Hence, from the definitions of it follows that we only need the joint 

asymptotic distribution of ($(Zi„), ^^Z 2 „)).By the continuity theorem for 

weak convergence, therefore, (f/„,Ki) =► Thus, we need to 

derive the joint density of (4>(Zi), ^{yj ^^^ 2 )), which will be our h{u,v). 

Let f{x,y) denote the bivariate normal density of (Zi,Z 2 ), i.e., let 


f{x, y) = 


27Tv/r^ 


Then, 


H{u,v) = P c^(Zi)<u,$| 


'k- 1 


Z2 I < u 


pUi < •J>'‘(u),Z2 < 

J —00 J ~oo 


f{x,y)dydx. 


The joint density /i(u, u) is obtained by obtaining the mixed partial derivative 
g^P(u,u). Direct differentiation using the chain rule gives 

/,(„,„) = ‘ (“)■ \/^^' ' (”)) ■ 

on some algebra. 

From here, the stated formula for h{u, v) follows on some further algebra, which 
we omit. □ 


3.2. Learning from the plots 

It is clear from the expression for h{u, v) that if the third central moment ps is zero, 
then U, V are independent; moreover, U is marginally uniform. Thus, mtuitively, 
we may expect that our proposal would have less success for distinguishing normal 
data from other symmetric data, and more success in detecting nonnormality when 
the population is skewed. This is in fact true, as we shall later see in our simulations 
of the test. It would be useful to see the plots of the density /i(u, v) for some trial 
nonnormal distributions, and try to synchronize them with actual simulations of 
the bootstrapped pairs wl . Such a synchronization would help us learn something 
about the nature of the true population as oppo.sed to just concluding nonnormality. 
In this, we have had reasonable success, as we shall again see in our simulations. 
We remark that this is one reason that knowing the formula in Theorem 4 for the 
asymptotic density h{u, u) is useful; other uses of knowing the asymptotic density 
are discussed below. 

It is informative to look at a few other summary quantities of the asymptotic 
density h{u,v) that we can try to synchronize with our plots of the ic*. We have 
in mind summaries that would indicate if we are likely to see an upward or down- 
ward trend in the plot under a given specific F, and if we might expect noticeable 
departures from a uniform scattering such as empty corners. The next two results 
shed .some light on those questions. 

Theorem 5. Let {U, V) ~ h{u,v). Then, p := Corr{U, V) has the following values 
for the corresponding choices of F: 
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p « .69 if F = Exponential; 
p ss .56 if F = Chisquare(5); 
p ^ .44 if F = Beta(2,6); 
p % .50 if F = Beta(2, 10); 
p % .53 if F = Poisson(l); 
p ss .28 if F = Poisson(5). 

The values of p stated above follow by using the formula for h{u, v) and doing 
the requisite expectation calculations by a two dimensional numerical integration. 

A discussion of the utility of knowing the asymptotic correlations will follow the 
next theorem. 

Theorem 6. Let pn = P{U < .2,V < .2), pi 2 = P{U < ■2,V > .8), pi.3 = P{U > 
.8, V < .2) andpx4 = P{U > .8,V > .8). 

Then, pn — p \2 = P13 = pu = 04 if F = Normal; 

Pii = .024, pi 2 = .064, pi,3 = .0255, pi 4 = .068 if F = Double Exponential; 
pii = .023, pi2 = .067, pi3 = .024, pi4 = .071 if F == t(5); 

Pn = .01, pi 2 = .02, pi3 = .01, pi4 = .02 if F = Uniform; 

Pn = .04, P12 = .008, pi3 = .004, pi4 = .148 if F = Exponential; 

Pn = 04, P12 = .012, pi3 = .006, pi4 = .097 if F = Beta(2, 6); 

Pn = .045, P12 = .01, pi3 = .005, pu = .117 if F = Beta(2, 10). 

Proof. Again, the values stated in the Theorem are obtained by using the formula 

for li{u,v) and doing the required numerical integrations. □ 

3.3. Synchronization of theorems and plots 

Together, Theorem 5 and Theorem 6 have the potential of giving useful information 
about the nature of the true CDF F from which one is sampling, by inspecting the 
cloud of the w’ and comparing certain features of the cloud with the general pattern 
of the numbers quoted in Theorems 5 and 6. Here are some main points. 

1. .A pronoumred upward trend in the cloud would indicate a right skewed 
population (such as Exponential or a small degree of frc*edom chisquare or a right 
skewed Beta, etc.), while a mild upward trend may be indicative of a population 
slightly right skewed, such as a Poisson with a moderately large mean. 

2. To make a finer distinction. Theorem 6 can be luseful. Pn,p\ 2 ,PvSiPi 4 respec- 
tively measure the density of the points in the lower left, upper left, lower right, 
and the upper right corner of the cloud. From Theorem 6 we learn that for right 
skewed populations, the upper left and the lower right corners should be rather 
empty, while the upper right corner should be relatively much more crowded. This 
is rather interesting, and consistent with the correlation information providetl by 
Theorem 5 too. 

3. In contrast, for symmetric heavy tailed populations, the two upper corners 
should be relatively more crowded compared to the two lower corners, as we can 
see from the numbers obtained in Theorem 6 for Double Exponential and t(5) 
distributions. For uniform data, all four corners should be about equally dense, with 
a general sparsity of points in all four corners. In our opinion, these conclusions that 
one can draw from Theorems 5 and 6 together about the nature of the true CDF 
are potentially quite useful. 

We next present a selection of scatterplots corresponding to our test above. Due 
to reasons of space, we are unable to present all the plots we have. The plots we 
present characterize what we saw in our plots typically; the re.sample size B varies 
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Bootstrap Test for Normality Using N(0,1) Data; n = 6 
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Bootstrap Testing for Normality Using Exp(l) Data; n - 6 
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BOOTSTRAP TEST FOR NORMALITY USING N(0,1) DATA; n ■ 25 
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BOOTSTRAP TEST FOR NORMALITY USING U[0,1) DATA; n » 25 
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BOOTSTRAP TEST FOR NORMALITY USING t(4) DATA; n - 25 
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BOOTSTRAP TEST FOR NORMALITY USING BXPU) DATA; n • 25 
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between 100 and 200 in the plots. The main conclusions we draw from our plots 
are summarized in the following discussion. 

The most tlramatic aspect of these plots is the transparent structure in the 
plots for the right skewed Exponential case for the extremely small sample size of 
n = 6. We also see satisfactory agreement as regards the density of points at the 
corners with the .statements in Theorem 6. Note the relatively empty upper left and 
lower right corners in the Exponential plot, as Theorem 6 predicts, and the general 
sparsity of points in all the corners in the uniform case, also as Theorem 6 predicts. 
The plot for the t case shows mixed succe.ss; the very empty upper left corner is not 
predicted by Theorem 6. However, the plot it.self looks very nonuniform in the unit 
square, and in that sense the t(4) plot can be regarded as a success. To summarize, 
certain predictions of Theorems 5 and 6 manifest reasonably in these plots, which 
is reassuring. 

The three dimensional plots of the asymptotic density function h{ii, v) are also 
presented next for the uniform, t(5), and the Exponential case, for completeiu^ss 
and better understanding. 

3.4> Comparative power and a formal test 

While graphical tests have a simple appeal and are preferred by some, a formal test 
is more objective. We will offer some in this subsection; however, for the kinds of 
small sample sizes we are emphasizing, the chi-square approximation is not good. 
The correct percentiles needed for an accurate application of the formal test would 
require numerical evaluation. In the power table reported below, that was done. 

The formal test 

The test is a standard chLsquare te.st. Partition the unit square into subrectangU^s 
[a,, 6j], where = 6, = .2i, and let in a collection of B points, Oij be the observed 
number of pairs w* in the subrectangle [a,, 6j]. The expected number of points in 
each subrectangle is .04 B. Thus, the test is as follows: 

Calculate = XI P-value P(x"^(24) > x")- 

How does the test perform? One way to address the i.ssue is to see whether a t(?st 
statistic based on the plot has reasonable power. It is clear that the plot-based tests 
cannot be more powerful than the best test (for a given alternative), but maybe 
they can be competitive. 

We take the best test to be the likelihood ratio test for testing the alternative 
versus the normal, using the location-.scale family for each distribution. The plot- 
based tests include the test in the paper, two based on the MAD{v*) (median 
absolute deviation of the u*’s), one which rejects for large values and one for small 
values, and two based on Correlation{u\,vl). Note the likelihood ratio test can 
only be used when there is a specifier! alternative, but the plot-baswl tests are 
omnibus. Thus, what counts is whether the plot-based tests show .sonie all round 
good performance. 

The tables below have the estimated powers (for o = 0.05) for various alterna- 
tives, for n = 6 and 25. 


n = 6 


MAD(>) 

MAD(<) 

Corr(>) 

Corr(<) 

LRT 

Normal 

0.050 

0.050 

0.050 

0.050 

0.050 

0.050 

Exponential 

0.176 

0.075 

0.064 

0.293 

0.006 

0.344 

Uniform 

0.048 

0.033 

0.105 

0.041 

0.044 

0.118 

h 

0.185 

0.079 

0.036 

0.146 

0.138 

0.197 

h 

0.070 

0.059 

0.043 

0.064 

0.067 

0.089 
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Plot of Theoretical Asymptotic Density h(x,y) in U[-l,l) Case 


Plot of Theoretical Asymptotic Density h(x,y) in t(5) Case 
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Plot of Theoretical Asymptotic Density h(x,y) in Exp(l) Case 



n = 25 


MAD(>) 

MAD(<) 

Corr(>) 

Corr(<) 

LRT 

Normal 

O.O.'jO 

0.0,50 

0.050 

0.0,50 

0.0,50 

0.050 

Exponential 

0.821 

0.469 

0.022 

0.930 

O.(KM) 

0.989 

Uniform 

0.164 

0.000 

0.506 

0.045 

0.038 

0.690 

h 

0.5.53 

0.635 

0.003 

0.261 

0.264 

0.721 


0.179 

0.208 

0.011 

0.104 

0.121 

0.289 


The powers for n = 6 are naturally fairly low, but we can see that for each 
distribution, there is a plot-basetl test that comes reasonably close to the LRT. For 
the Exponential, the correlation (>) test does very well. For the uniform, the best 
test rejects for small values of MAD. For the t’s, rejecting for large values of MAD 
works reasonably well, and the atid two correlation tests do fine. These results 
are con.sLstent with the plots in the [taper, i.e., for skeweti dLstributions there is a 
positive correlation between the u,*'s and u*’s, and for symmetric distributions, the 
differences are revealed in the spread of the u*’s . On balance, the Corr(>) test for 
suspecteil right skewed cases and the test for heavy-tailetl symmetric cases .seem 
to be good plot-based formal tests. However, further numerical power studies will 
be necessary to confirm these recommendatioiLs. 

3.5. Another pair of statistics 

One of the strengths of our approach is that the pair of statistics that can be itsed 
to define U„, V„ is flexible, and therefore different tests can be used to test for 
normality. VVe now describe an alternative test based on another pair of statistics. 
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It too shows impressive power in our simulations in detecting right skewed data for 
quite small sample sizes. 

Let Xi, X 2 , . . . , X„ be the sample values and let Q,s denote respectively the 
interquartile range and the standard deviation of the data. FVom Basu’s theorem 
(Basil (1955)), ^ and s are independent if Xj, Xa, . . . are samples from any 
normal distribution. The exact distribution of ^ jn finite samples is cumbersome. 
So in forming the quantile transformations, we use the asymptotic distribution of 
2. Tliis is, admittedly, a compromise. But at the end, the test we propose still 
works very well at least for right skewed alternatives. So the compromise is not a 
serious drawback at least in some applications, and one has no good alternative to 
using the asymptotic distribution of The asymptotic distribution of ^ for any 
population F with four moments is explicitly worked out in DasGupta and Half 
(2003). In particular, they give the following results for the normal. Exponential 
and the Beta(2,10) case, the three cases we present here as illiLstration of the |x)wer 
of this test. 

(a) \/n(^^ — 1.349) => N(0, 1.566) if F = normal; 

(b) v/n(^^ — 1.099) =» 1V(0, 3.060) if F = Exponential. 

(c) v^(^ - 1-345) =4- iV(0, 1.933) if F = Beta(2,10). 

Hence, as in Subsection 3.1, define: 

u* = 4>(4^(^ - 2))^ and e,* = G„_i((n - 1)^) and w’ = (u%v’); note that 

is the appropriate variance of the limiting normal distribution of as we 
indicate above. As in Subsection 3.1, we then plot the pairs tc,’ and check for an 
approximately uniform scattering, pariicitlarly lack of any striking structure. 

The plots below are for the normal. Exponential and Beta(2,10) case; the last 
two were chosen because we are particularly interested in establishing the efficacy 
of our procetlures for picking up skewed alternatives. It is clear from the plots that 
for the skewed cases, even at a small sample size n = 12, they show strikitig visual 
structure, far removed from an approximately uniform scattering. In contrast, the 
plot for the normal data look much more mtiform. 

Exactly as in Subsection 3.1, there are analogs of Theorem 3 and Theorem 4 
for this case too; however, we will not present them. 

We now address the multivariate rase briefly. 

4. Resampling based tests for multivariate normality 

As in the univariate ca.se, our proposed test itses the independence of the sample 
mean vector and the sample variance-covariance matrix. A diffictilt issue is the 
selection of two statistics, one a function of the mean vector and the other a function 
of the covariance matrix, that are to be used, as in the univariate case, for obtaining 
the w' via use of the quantile transformation. We use the statistics FX, and either 
tr(E“'S), or |^. Our choice is exclusively guidetl by the fact that for these cases, 
the distributions of the statistics in finite samples are known. Other choices can (and 
should) be explored, but the technicalities would be substantially more complex. 

Test 1. Suppose Xi, X 2 , . . . , X„ are iid p-variate multivariate normal oWrvatioiLs, 
distributetl as Xp(/t, E). Then, for a given vector c, FX ~ Np{c'(t , jFEc), and 
tr(E“*S) ~ chisquare(p(n — 1)). Thus, using the same nutation as in Section 3.1, 

and T„=Gp(„_,)(tr(E-'S)) 
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Test for Univariate Normality Using IQR and s; Data ■ N(0,1), n ■ 12 


1 - 


0.8 - 


0.6 ■ , 


• • • 


*•• •• 
% 


• • • • 

• • • • « • 


^ . • 




• • • 

• « 




• • 


0.4 \ 


• • 


• • 
•s 


0-2 S . 


• • 




* * * 

• * • * • * 
• • •** • 

. 


0.2 


0.4 


0.6 


0.8 


Test for Univariate Normality Using IQR and s; Data ■ Exp(l), n ■ 12 
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Test for Univariate Normality Using IQR and s; Data ■ Beta(2,10), n = 12 

• • : 



- -i ■ ■ , i ^ ♦ g-i ^ » 1 ■ » • I 

0.2 0.4 0.6 0.8 1 


are independently f/[0, 1] distributed. For i = 1, 2, . . . , define 
u, 

where X, , 5* are the mean vector and the covariance matrix of the ith bootstrap 
.sample, and X , S are the mean vector and the covariance matrix for the original 
data. As before, we plot the pairs w* = (u*,v^),i = 1,2, ...,B and check for an 
approximately uniform scattering. 

Test 2. Instead of tr(E“*S’), consider the statistic ~ nf=i where the 

chisquare variables are independently distributed. 

For the special case p = 2, the distribution can be reduced to that of X 
(see Anderson (1984)). Hence, t/„ (as defined in Test 1 above), and 

are independently U[0, 1] distributed. Define now u* as in Test 1 above, but 



and plot the pairs w* = (u^ , u,* ) to check for an approximately uniform scattering. 

The CDF of can be written in a reasonably amenable form also for the case 
p = 3 by iLsing the Hypergeometric functions, but we will not describe the three 
dimensional case here. 

As in the univariate case, we will see that Tests 1 and 2 can be quite effective 
and especially for small samples they are relatively more useful than alternative 


X, -c^xy 

\/dSc. 


and u*=Gp(„_ii(tr(5-‘5*)), 
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tests used in the literatiire. For example, the common grapliical test for bivariate 
normality that plots the Mahalanobis values agaiiLst chisquare percentiles (see 
•lohnson and Wichern (1992))would not have very much credibility at sample sizes 
such as n = 10 (a sample size we will try). 

Corres|)onding to Theorem 3 , we have a sitnilar coiusistency theorem. 

Theorem 7. sup,,<„<, o<„<, \P.(U' < u. V < v) - Pr(U„ < u,V„ < t»)| -• 0 in 
probability, provided the. true. CDF F han four moments (in the imnal sense for a 
multivariate CDF). 

The nonull asymptotics (i.e., the analog of Theorem 4) are much harder to 
write down analytically. We have a notationally messy version for the bivariate 
case. However, we will not present it due to the notational complexity. 

The plots of the pairs w’ corresponding to both Test 1 and Test 2 are im[x>rtant 
to examine from the |X)int of view of applications. The plots corres|)onding to the 
first test are presented next. The plots corresponding to the second tost look very 
similar and are omittetl here. 

The plots again show the impressive power of the tests to detect skewness, 
as is clear from the Bivariate Gamma plot (we adopt the definition of Bivariate 
Gamma as (X,Y) = (U + IT, V' + IT), where f/, V', IT are independent Gammas 
with the .same .scale parameter; see LI (2003) for certain recent applicatioits of 
such representations.) The normal plot looks rea.sonably devoirl of any structure or 
drastic nonuniformity. Considering that testing for bivariate normality continues 
to remain a very hard problem for such small sample sizes, our proposals ap|>ear 
to show gootl potential for being iLseful and definitely (xmipetitive. The ideas we 
present need to be examined in more detail, however. 

5. Subsainpling based tests 

An alternative to the resampling based tests of the preceding .sections is to u.se 
.suitsampling. From a purely theoretical point of view, there is no reason to pre- 
fer suitsampling in this problem. Resampling and siibsampling will both produce 
uniformly consistent distribution estimators, but neither will produce a test that is 
consistent against all alternatives. However, as a matter of practicality, it might be 
useful to u.se each method as a complement to the other. In fact, our subsampling 
based plots below show that there is probably some truth in that. In this section 
we will present a brief description of subsampling based tests. A more complete 
|)resentation of the ideas in this six-tion will be presented elsewhere. 

5.1. Consis tency 

W'e return to the univariate case and again focits on the independence of the 
sample mean and sample variance; however, in this section, we will consider the 
suitsampling metluKlologj'— see e.g., Politis, Romano and Wolf (1999). Denote by 
Bh.i, . . . , Bb.Q the Q - (J) sulxsamples of size b that can be extracted from the 

.sample Ai,...,A'„. The subsamples are ordered in an arbitrary fashion except 
that, for convenience, the first q = [ri/6j subsamples will be taken to be the 
non-overlapping stretches, i.e., Bh.i = (A'l, . . . , ^5), Bb .2 = (As+i, . . . , .Vji,), . . ., 
Bb.q = (X(,_|)|,^.i, . . . . A,fc). In the above. 6 Ls an integer in (l,n) and [•] denotes 
integer part. 

Let Xb.i and .Sj , denote the sample mean and sample variance ns calculated from 
sul)sample Bb,i alone. Similarly, let Ub.i = 4»( and Vb,, = 
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Bivariate Normality Teat using n ■ 10, c » (1,1), and tr (SI(9tA* (-DS) ; data « BVN(O.Z) 




-(1.1) and tr(SIGHA^(>l)S) ; data - BVGamma 
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Gb-i{ — gi ’’ ■ * )• Thus, if b were n, these would just be Un and Vn as defined 
in subsection 3.1, Note that Ub,i and Vj,,* are not proper statistics since p and 

<T are unknown; our proxies for Uh,i and V/,,j will be Ut,,i = <I*( ^ — ) and 

Vb,i = respectively. 

Let Hb{x,y) = P{Uh^\ < x,Vh,i < y). Recall that, under normality, H{, is uni- 
form on the unit square. However, using subsampling we can consistently estimate 
Hb (or its limit H given in Theorem 4) whether normality holds or not. As in Politis 
et al. (1999), we define the sulxsampling distribution estimator by 

1 ^ . 

Lh{x,y) = < x,Vh,i < y}. (2) 

‘ 1=1 

Then the following consistency result ensues. 

Theorem 8. Assume the conditions of Theorem 4. Then 

p 

(i) For any fixed integer b > I, we have Li,{x,y) — ► Hi{x, y) as n —* oo for all 
points (x,y) of continuity of Hi,. 


^ p 

(ii) If mm{b,n/b) —>■ oo, then sup^ ^ |L/,(.r, y) — H{x,y)\ — ► 0. 

Proof, (i) Let {x, y) be a point of continuity of //<,, and define 

1 Q 

Li,{x, y) = ^ 1 ' « ~y}- 

* i=l 

Note that by an argument similar to that in the proof of Theorem 2.2.1 in Politis, 
Romano and Wolf (1999), we have that 

U{x,y) - Lh{x, y) 0 


p 

on a set whose probability tends to one. Thus it suffices to show that Lh{x, y) — ► 
Hh{x,y). But note that ELi,{x,y) = lli,{x,y); hence, it .suffices to show that 
Var{Lb{x,y)) = o(l). 

Let 


1 , ^ ^ 

Lt,{x, j/) = - ^ 1 {LVi < X, Vh,i < //}• 

By a Cauchy-Schwartz argument, it can be shown that Var{Lb{x, y)) < 

Var{Lb{x, y))\ in other words, extra averaging will not increase the variance. 

But Var{Lb{x,y)) = 0{\/g) = (){bjn) since Li,{x,y) is an average of q i.i.d. 
random variables. Hence Var{Lb{x,y)) = 0{b/n) = o(l) and part (i) is proven. 
Part (ii) follows by a similar argument; the uniform convergence follows from the 
continuity of H given in Theorem 4 and a version of Polya’s tlun^rem for random 
cdfs. □ 


5.2. Subsampling based scatterplots 

Theorem 8 suggests looking at a scatterplot of the pairs Wb^i = {Ub.i, 14, ») to detect 
non-normality since (under normality) the points should look uniformly .scattered 
over the unit square, in a fashion analogous to the pairs in Sections 3 and 4. 

Below, we present a few of these scatterplots and then di.scu.ss the j)lots. The 
subsample size b in the plots is taken to be 2. 

For each distribution, two separate plots are presented to illustrate the quite 
dramatic nonuniform structure for the nonnormal cases. 
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Subsampling Based Test for Normality using N(0,1) Data; n = 25,b=2 
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SubsaiRpling Based Test Cor Normality using ExpU) Data; n 
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Subsampling Based Test for Normality using U(0,1] Data; n ■ 25,b>2 
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Subsampling Based Test for Normality using U(O.l) Data; n ■ 2S,b>2 
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5.3. Discussion of the plots 

Again, we are forced to present a limited number of plots due to si>ace consid- 
erations. The plots corresponding to the Exponential and the uniform case show 
obvious nonuniform structure; they also show significant amounts of empty space. 
In fact, compared to the corresponding scatterplots for uniform data for the boot- 
strap based test in Section 3.3, the structvu'ed deviation from a uniform scattering 
is more evident in these plots. Subsampling seems to be working rather well in de- 
tecting nonnc)rmality in the way we propose here! But there is also a problem. The 
problem seems to be that even for normal data, the scatterplots exhibit structured 
patterns, much in the same way for uniform data, but to a lesser extent. Additional 
theoretical justification for these very special patterns in the plots is needed. 

VVe do not address other Issues such as choice of the suhsample size due to space 
consideratioas and for our focus in this article on just the resampling part. 


6. Scope of other applications 

The main merits of our proposal in this article are that they give a user something 
of credibility to use in small samples, and that the proposal has scope for broad 
applications. To apply our proposal in a given problem, one only has to look for an 
effective characterization result for the null hypothesis. If there are many charac- 
terizations available, presumably one can choose which one to use. We give a very 
brief discussion of potential other problems where our proposal may be useful. We 
plan to present these ideas in the problems stated below in detail in a future article. 


1. Testing for sphericity 


Supix>se Jfi, A^ 2 ) • • • 1 are iid p-vectors and we want to test the hypothesis Ho: 
the common distribution of the Xi is spherically symmetric. For simplicity of ex- 
planation here, consider only the case p = 2. Literature on this problem includes 
Baringhaus (1991), Koltchiitskii and Li (1998) and Beran (1979). 

TYansforming each X to its polar coordinates r, 9, under Ho, r and 6 are inde- 
pendent. Thus, we can test Ho by testing for independence of r and 0. The data 
we will use is a sample of n pairs of values (ri,0,),i = 1,2, ...,n. Although the 
testing can be done directly from these pairs without recourse to resampling or 
subsampling, for small n, re or subsampling tests may be useful, as we witnessed 
in the preceding sections in this article. 

There are several choices on how we can proceed. A simple correlation based 
test can be used. Specifically, denoting D{ as the difference of the ranks of the rt 
and 9i (respectively among all the r; and all the 8,), we can use the well known 
Spearman coefficient: 


rs = 1 


n{n^ — 1 ) ' 


For small n, we may instead bootstrap the (n, Si) pairs and form a scatterplot of 
the bootstrapped pairs for each bootstrap replication. The availability of replicated 
scatterplots gives one an advantage in as.sessing if any noticeable correlation between 
r and 9 .seems to be present. This would be an easy, although simple, visual method. 
At a slightly more sophisticated level, we can bootstrap the rs statistic and compare 
percentiles of the bootstrap distribution to the theoretical percentiles under Ho of 
the rs statistic. We are suggesting that we break ties jiLst by halving the ranks. 
For small n, the theoretical percentiles are available exactly; otherwise, we can use 
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the |>ercentiles from the central limit theorem for rs as (hopefully not too bad) 
approximations. 

VVe should mention that other choices exist. An obvious one is Hoeffding’s D- 
slatistic for independence. Under Ho, uD„ + ^ has a known (nonuormai) limit 
distrihution. Although an exact formula for its CDF appears to be unknown, from 
the known formula for its characteristic fimction (see Hoeffding (1948)), we can 
pin down any specified percentile of the limit distribution. In addition, for small 
n, the exact distribution of D„ under Ho is available too. We can thus find either 
the exact or approximate percentiles of the sampling distribution of nD„ + and 
compare percentiles of the bootstrap distribution to them. If we prefer a plot based 
test, we can construct a Q-Q plot of bootstrap percentiles against the theoretical 
percentiles under Ho and interpret the plot in the standard maimer a Q-Q plot is 
used. 


2. Testing for Poissonity 

This is an important problem for practitioners and has quite a bit of literature, e.g., 
Brown and Zhao (2(M)2), and Gurtler and Henze (2000). Both articles give references 
to classic literature. If Xi,X 2 , ...,X„ are iid from a Poisson(\) distribution, then 
obviously J^r^i ^ Poisson-di-stributed. and therefore everj' cumulant of the 
sampling distribution of 53"=i consider testing that a .set of spec- 

ified cumulants are equal by iLsing re or subsampling methods. Or, we can consider 
a fixed cumulant, say the third for example, and inspect if the ciunulant estimated 
from a Ixiotstrap distribution behaves like a linear function of n passing through 
the origin. For example, if the original sample size is n = 15, we can estimate a 
given order cumulant of Y1T=\ = 1, 2, . . . , 15, and visually assess if 

the estimaU>d values fall roughly on a straight line passing tlirough the origin as m 
runs through 1 to 15. The graphical test can then l>e rei>eate<i for a cumulant of 
another order and the slopes of the lines compared for approximate equality too. 
Using cumulants of different orders would make the test more powerful, and we 
recommend it. 

The cumulants ran l>e estimated from the bootstrap distribution either by dif- 
ferentiating the empirical cumulant generating function log(^^, e‘*P,(S’ = s)) or 
by estimating instead the moments and then using the known relations between 
cumulants and moments (see, e.g., Shiryaev (1980)). 

3. Testing for exponcntiality 

Tt^st ing for exponentiality has a huge literature and is of great interest in many areas 
of application. We simply recommend Doksum and Yandell (1984) as a review of 
the classic literature on the problem. A large number of characterization results for 
the family of Exponential distributions are known in the literature. E^ssentially any 
of them, or a combination, can be used to test for exponentiality. We do not have 
reliable information at this time on which characterizations translate into better 
tests. We mention here only one as illustration of how this can be done. 

One jxissibility Is to use the spacings based characterization that (n — i+ l)fl, 
are iid Exponential{X) where A is the memi of the population under Ho, and R, 
are the successive spacings. There are a number of ways that our general method 
can be used. Here are a few. A simple plot based test can select two values of i, for 
example i = [n/2], and [n/2| -I- 1, so that the ordinary bootstrap instead of a m-out 
of-n bootstrap can be used, and check the pairs for independence. For example, a 
scatterplot of the bootstrapped pairs can be constructed. Or, one can standardize 
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the bootstrapped values by X, so that we will then have pairs of approximately 
iid Exponential^!) N-alues. Then we can iu>»' the quantile traiLsformation on them 
and check these for uniformity in the unit square as in Section 3. Or, just as we 
described in the section on testing for sphericity, we can use the Hoeffding D- 
statistic in conjunction with the bootstraj) with the selected pairs of (n — t + 1)/?,. 
One can then use two other values of i to increase the diagnostic imwer of the test. 
There are ways to use all of the (n — i + 1 )/?, simultaneously as well, but we do not 
give the details here. 
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Notes on the bias-variance trade-off 

phenomenon 

Jeesen Chen^ 

University of (Jincinnati 

Abstract: The main ine<jimlity (Theorem 1) here involves the Hellinger dis- 
tance of a statistical model of an observation A’, which imposes bounds on the 
mean of any estimator in terms of its variance. We use this inequality to explain 
some of the bias-variance trade-off phenotnena studied in Doss and Sethura- 
man (1989) and Liu and Brown (1993). We provide some quantified results 
about how the re<luction of bias would increase the variance of an estimator. 


1. Introduction 

In certain estimation problems the following “bias-variance trade-off' [ilitMiomenon 
might occur; the price of reducing the bias of an etimator T is the dramatic increase 
of its variance. For problems exhibiting this property, one shouldn’t apply the bias 
reducing procetlures blindly. Furthermore, any estimator having good mean square 
error performance should be biased, and there is a Imlance btitween the bias function 
and the variance function. It is desirable to study the scope of this phenomenon 
and how the variance an<l the bias of an estimator affect each other. 

Do.ss and Sethuraman (1989) seem to have been the first to demonstrate the 
existence of the long suspected bias-variance trade-off phenomenon. However, this 
result requires stringent conditions, such as the nonexistence of unbiased estimators 
for the problem and the square integrability of relative densities for the statistical 
model, thus severely restricting its applicability. 

Liu and Brown (1993) broa^leiu’d the scope' of, and brought a new element, the 
singular/regular property of an estimation j)ioblem, into the study of the trade-off 
phenomenon. Here the focus is on a special aspect of the trade-off i^hcnomenon, the 
“nonexistence of informative (i.e. bounded variances) unbiased estimators” proj>- 
erty, and its connection with the singular/regular property is studied. For singular 
estimation problems, the bias- variance trade-off phenomenon is an essential compo- 
nent since the “nonexistence of informative unbiased estimators” property always 
holds (see Theorem 1 of Liu and Brown (1993)). For regular estimation problems, 
however, the connection is not clear, (^n one hand, due to the effect of a singular 
point as a limiting point, the “nonexistence of informative unbiased estimators” 
property dotw ocoir in .some regular t'stinmtion problems, even though those prol>- 
lems may be quadratic-mean-differentiable with Fisher information totally bounded 
away from zero. (See Example 2 of Liii and Brown (1993)). On the other hand, there 
are many known regular estimation problems having informative unbiased estima- 
tors. Therefore, focusing on the singular/regular property alone can’t completely 
describe the .scope of bias-wariance trade-off phenomenon. 
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It is intriguing to consider how the results of Liu and Brown (1993) may be 
perceived. The impression may be that Theorem 1 of Liu and Brown (1993), the 
•‘nonexistence of informative unbiased estimators” for a singular estimation prob- 
lem, seems compatible with the well-known Rao-Cramer inequality. This inequality, 
under suitable regularity conditions, provides a lower bound of variances for unbi- 
ased estimators in terms of the reciprocal of the Fisher information number. For a 
singular point (or, a point with zero Fisher information number), the lower bound 
of variances for unbiased estimators becomes infinite; hence it is impossible to have 
an informative unbiased estimator (if the regularity conditions of Rao-Cram^r in- 
equality hold). With this impression, one might be surpriscxl to s€?e Example 4 of 
Liu and Brown (1993) which exhibits an unbiased estimator with finite variance at 
a singular point. This seems to contradict the Rao-Cramer inequality or Theorem 1 
of Liu and Brown (1993). Of course, there is no contradiction here: first. Exam- 
ple 4 of Liu and Brown (1993) violates the required regularity conditions for the 
Rao-Cramer inequality; second. Theorem 1 of Liu and Brown (1993) only prevents 
the possibility of an unbiased estimator having a uniform finite upper bound for 
variances in any Ilellinger neighborhood of a singular point, and not the possibility 
of an unbiased estimator with finite variance at a singular point. Nevertheless, the 
possible confusion indicates the need to find a framework in which we can put all the 
perception here into a more coherent view. One suggestion is to use an “appropriate 
variation” of the Rao-Cramer inequality to understand the bias-variance trade-off 
phenomenon. This modification of the Rtio-Cramer inwiuality would place restric- 
tions regarding the variances of unbiased estimators on the supremum of variances 
in any Bellinger neighborhood of a point, instead of restricting the variance of the 
point only. (W'e believe our results in this paper validate the above suggestion.) 

Low (1995), in the context of the functional estimation of finite and infinite 
normal populations, studies po.ssible bias-variance trade-off by solving explicitly 
constrivint optimization problems: imposing a constraint on either the variance or 
the square of the bias, then finding the procedure which minimizes the supremum 
of the unconstrained performance measure. This approach, due to mathematical 
difficulties involved, seems very difficult to carry out for general estimation prob- 
lems. However, the investigation of the “bifus-variance trathvoff” phenomenon in 
the framework of the study of quantitative restrictions between bias and variance 
is interesting. 

In this paper, we observe that the “nonexistence of informative unbiased esti- 
mators” phenomenon and the “bias-variance trade-off” phenomenon exemi)lify the 
mutual restrictions between mean functions and variance functions of (estimators. 
These rcestrictions are described in our main inequality. Theorem 1. We are able to 
use this inequality to study, for finite sample cases, the “bias- variance trade-off” 
phenomenon and the “nonexistence of informative unbiased estimators” phenom- 
enon for singular as well as regular estimation problems. A simple application of 
Theorem 1, Corollary 1, induces a sufficient condition for the “nonexistence of 
informative unbiased estimators” phenomenon. Corollary 1 is applicable to singu- 
lar problems, (e.g. it implies Theorem 1 of Liu and Brown (1993)), as well as to 
regular problems (e.g. Example 2 of Section 4). Additional applications, such as 
Theorem 2 and Theorem 3, shed further light on the trade-off phenomenon by giv- 
ing some quantified results. These results not only imply (and extend) Theorem 1 
and Theorem 3 of Liu and Brown (1993), they also provide a general lower bound 
for constraint minimax performance. (See Corollary 3 and related comments.) We 
may summarize the idea conveyed by the.se results as: if the estimator we consider 
has variance less than the smallest possible variances for any unbiased estimators, 
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tlien the range of the bias function is at least comparable to a fixed proportion of 
the range of the parameter function to be estimated. 

VV^e address the influence of a singular point as a limiting (parameter) point in 
Theorem 4. Although this is not a direct conse<iuence of Theorem 1, the format of 
Theorem 1 facilitates results like Theorem 4. 

VV^e state our results in Section 2 and prove them in Section 3. In Section 4 we 
explain the meaning of Example 2 and Example 4 of Liu and Brown (1993) in our 
approach to the “bias- variance trade-off' phenomenon. VV’e also argue that examples 
like Example 4 of Liu and Brown (1993) validate our version of the “mean-variance 
restriction,” in which the restrictions imposing on the bias function of an estimator 
by its variance function are on the difference of biases at two points instead of 
the bias function at a point. Example 1 of Section 4, which has been considered 
by Low (1995) (and maybe others also), shows that our lower bound for minimax 
performance. Corollary 3, is sharp. The last example. Example 2, shows that the 
“nonexistence of informative unbiased estimator” phenomenon may occur even if 
the parameter space does not have any limiting point (with respect to Hellinger 
distance.) 

2. Statements of results 

We shall consider the following c*stimation problem. Let X be a random variable, 
which takes values in a measure space (12, /i), with distribution from a family of 
probability measures T = {Pg : 0 ^ Furthermore, it is assumeKl that every Pg 
in T is dominated by the measure //, and if Po, = Pg.^^ then 9i = 62 - For 0 € B, 
we denote the Radon-Nikodym derivative of Pg with respect to the <T-measure 
as Jg = dPg/dn. For ^ B, let 


P(»l,fl2) := 1 [/», Wt)V 

(2.1) 

denote the Hellinger distance between 0\ and 62, on B, induced by the statistical 
model = {Pg : 6 e ^}. Suppo.se (V, || • |1) is a p.seudo-normed linear space, 

and <7 : B F is a function. We shall estimate q{0) based on an observation X. 

The estimators T : 12 V we consider are well-behaved functions (satisfying the 
required measurability conditions) so that, for € B 

i’riO) := f fg{x)T{x)fi{(Lc) 
Jn 

(2.2) 

is meaningful and belongs to V, and v^{9) := fg{x)\\T{x)\\'^ ^i{dx) 

We also adopt the following notaitons: 

is meaningful. 

lh{«) ■■= Ee(nx) - <;(»)), 

(2.3) 

the bias function of T ; 


Me) ■■= {EoWnx) - 

(2.4) 

the mean square risk function of T; and, for Bo C B, 


A /7 (e„) := Slip {£„||(r(X) - r/(fl)f : 0 6 0,,} . 

(2.5) 


The starting point of our study is the following inequalitj': 
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Theorem 1. For 6, On € 0, if p{9,6o) > 0, then 


+ 1t{0o)]p{0,0o) 


(*.(«) - /Jr(e«)) + 

1 - jp"(6,«o)^ 

1 (qW - qW) 

\\0T{e) - /Jr(«o)|| - ( 

1 - 

|||9(»)-9{eo)|| 


An easy consequence of (2.6) is: 

Corollary 1. Suppose 0i is a non-empty subset of — {0o}> then 
2sup{7r(^) : 0 € 01 U {^o}} 


+ sup 


f \mo)-MOo)\\ . 

I p{G,0o) 


:0€ 01 


) 


>sup|[l-i 


P^M) 


HO) - g(<?o)ll 

p{ 0,0 q) 


:(?€ 0 


} 


( 2 . 6 ) 


(2.7) 


Let us denote the value of the right-hand side of (2.7) as Qq{0o’,Si). We point 
out that the quantity Qq{0n; 0i) does not depend on the estimator T. It is easy to 
see that Q,y(^o;0i) = oo is a sufficient condition for the “nonexistence of informa- 
tive unbiased estimators” phenomenon. There are two ways to make Qq(0o;0i) = 
oo: either Moeti p{0,0n) > 0 with sup^^ejl - ^p^{9,6n)\H0) ~ 9(^o)|| = oo or 
infeee, p(^,^o) = 0 with Hnisupp^e = «>• See Example 2 of 

Section 4 for the first case and Examples 1 and 3 of Liu and Brown (1993) for the 
second case. 

In the following, we focius on the case that 9n is a limit point of 0i with respect 
to p— distance. Note that we may replace Qq{9n\ 0i) in the right-hand side of (2.7) 

by an easily computable lower bound limsupp(<,,fl„)_o,a€e, ' 

For the convenience of our discussion let as introduce: 


Definition 1 (Hellinger Information). Suppose 0i C 0 and 9^ is a non-isolated 
point of 01 with respect to p— metric on 0. The Hellinger Information of about 
the </(•)— estimation problem and the (sub-)parameter space 0i is defined as 


J,(^o;0i):=4 


limsup 

O+,0€Bi 


lkW-9(go)ll 

p{0,9n) 


-2 


( 2 . 8 ) 


For the development of this notation and its relationship to Fisher Information, 
see Chen (1995). We mention here that this notation is related to “sensitivity” 
proposed by Pitman (1978). Also, it Is equivalent to the “Geometric Information” 
in Donoho and Liu (1987), and, in terms of Hellinger modulus (see Liu and Brown 
(1993) (2.9) and (2.2)), it is (limf_o+ When Jq{9o;S) = 0 (resp. > 0), we 

say that the p(-)— estimation problem is singular (resp. regular) at point 9o. 

With the notation of Hellinger Information, an easy corollary of Theorem 1 is: 

Corollary 2. Suppose 9 q is an accumulation point of 0o C 0. Then, for J = 
*^</(^o;0{i) 


2 |A/r(e„)l'^“ + sup I : fl € e„,e ^ e„| > (2.9) 
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or, equivalently, 


sup 


j ll/3r(< 



|>;^[l-(A/r(eo)J)‘^'j. (2.10) 


A trivial implication of (2.10) is: if Mr(^o) < l/'f<j(^*«; f^»)t then T is not 
unbiased on 0o. Moreover, (2.10) puts a restriction on the bias function PtW 
of T. We shall state this restriction more e.xplicitly in the next theorem. 

Theorem 2. Suppose 0o is an accumulation point of Bo C B, and Af is a positive 


number such that M < [J,(^o; ©o))”* . Letd^i := 1— (A/J,(Oo; ©u))'^*- Suppose that 
T is an estimator with Ee\\T(X) - q{0)\\^ < M for aU0 & Bo. Then, for any A > 0, 
there exist 0\ e Bo, not dependent on T, .such that 0 < p(0\,0n) < (2A<1a/)’^*i 
I|9(®a) - 9(0o)|| > 0, and 


Applying Theorem 2, it is easy to obtain a lower bound for constraimxl minimax 
performance. 

Corollary 3. Let 0o be an accumulation point of i:i. J = J,(So;B) > 0. Let M 
and T be positive numbers, and 


where 0 is over ||q(9) - </(So)|| < t und T is over Ee\\T(X) — 9(fl)||* < A/. Then 


In the restriction normal mean case (see Example 1), the lower bound (2.12) Is 
sharp. 

Now, let us turn to the case in which 0o is a singular point, i.e., J,(0o;0o) = 0. 
Prom (2.9) or (2.10), we have either Africa) = oo or sup{ : q g 

Bo, 0 ^01)} = oo. This implies the non-existence of an informative unbiased 
estimator for such Bo. Therefore. Theorem 1 of Liu and Brown (1993) is a weaker 
version of Corollary 2. 

tVom Theorem 2 (or Corollary 3), it is easy to see that there exists no se- 
quence of asymptotically unbiased estimators (based on the same finite number of 
observations) that would have uniformly bounded variance in any small Hellinger 
neighborhood of a singular point 0^. Hence, Theorem 2 above implies Theorem 3 of 
Liu and Brown (1993). For singular estimation problems, those estimators achieving 
good mean square error performance must balance bias and variance, and (2.11) 
gives a quantitative result about its bias function 0r(6). Furthermore, we are able 
to describe the ’‘rate” of ||/?r(^) ~ /^(f^o)!! as follows. 

Theorem 3. Suppose Jq(0a', Bo) = 0. Let Bi = {0i,0-i , . . •} C Bo — {Oo} be a ,Woui 


sequence of 0o in the sense that linij—oc p(0j , Oo) = 0 and limj_oc = *• 

IfT is an estimator un</i sup{£ 9 ^||T(,Y) — q(0j)|p : j = 0, 1, 2, . . .} < oo, then 


||/3t(0x) - /3r(0„)|| > (1 - A)d„ ■ ||q(0x) - 9(0o)l|. (2.11) 


B(AI-,t) := inf sup { II /Jr(0) - Br(0o)||^} 
7 0 



( 2 . 12 ) 


„ ||/^(0,)-/?r(0o)ll ^ , 
IM0>)-q(0o)|| 


(2.13) 
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One of the important observations of Liu and Brown (1993) Ls that the bias- 
variance trade-off phenomenon might occur on a set 0i due to the effect of a 
singular point 0o as a limit point of 0i. The next result states it more explic- 
itly. 

Theorem 4. Suppose \ is a subspace of d— dimensional Euclidean space xifith 

the tusual Euclidean norm || • ||. Let Go be a singular point, 0i = C 

^ ~ {^o} ^ o slow sexfuence of Go and T be an unbiased estimator on 0i. Then. 
suplEsimX) - : 9 6 e,} = 00 . 


3. Proofs 

Theorem 1 is a simple application of the following inequality. 

Lemma 1. For points € V;^i,02 € 0 with p{G\,G 2 ) > 0, we have 


{£’»,||r(X) - >/,|f + {£,JT(X) - pj||"} 


1/2 


- v^(02) - ^p^{0u02){vi - m) 


/p{e,,e2). (3.1) 


Proof. Without loss of generality, we assume that Eof |1T(X) — r;j|p < oo for t = 1, 2. 
Define Qi(x) = /o;(a:)‘/^(T(x) - r/j) for i = 1,2, and 0{x) = 

Then 


/ /?(x)[ai(x) -l-Q2(a;)]/i(dx) 

Jn 

= [/o, (^)(T(x) - 7/1 ) - f 0 ^{x){T(x) - r/ 2 ) 

= Eo,{T{X)-rj,)-E0,{T{X)-rj2) 

+ f [feAx)feAx)Y^^P{dx)iVi -m) 

Jii 

= ( 0 r(^i) - ni) - (^r(^ 2 ) - m) + 


1 ” 


im - m) 


1 


= (Vt(^i) — t/Y(^2)) — — G2)- 


(3.2) 


On the other hand, by the triangle inequality and the Cauchy-Schwarz inequal- 


ity. 


/ /?( x )[ qi ( x ) -I- 02(x)]/7(rfx) 

Ju 

- 51 II / 0{x)oti{x)p{dx) 

.=1 

i=l 
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* [ f 1 r /■ 

<X] j^0^ix)n(dx) • |^Jf^||o,(z)|f;i{(ir) 

= p(0x,e2)'£[EB,\\nX)-,j,\f]'^\ 


1/2 


(3.3) 


Combining (3.2) and (3.3), wo obtain (3.1). 


□ 


Proof of Thexirem 1. Applying Lemma 1, we have 


[tt(^) + Tr(®o)]p(®, 0») 


> 

iPt{ 0) - rpr(0o) - ^P^(.0,0a)[q(O) - 9(^o)) 


= 

/Jr(0) - 0t(0o) + (l - 5 p'( 0 .e„)) {q{0) - 

9(®o)) 


(3.4) 


this proves the first inequality of (2.6). 

Applying the triangle inequality and the fact that 1 — ^p^{0,Oo) 0i we obtain 
the second inequality of (2.6). □ 

Proof of Corollary 1. Notice that (2.6) implies 


2max(->r(^)OT(®u)) + 


> 


1 - 


||/?r(g)-/3r(flo)|| 

p(9,0o) 

Ik(g) - g(fln)ll 

P(0,0u) 


(3.5) 


Letting 0 vary over 0| in inequality (3.5), we obtain (2.7). 
Proof of Corollary 2. It is easy to prove that 


Qq{0a\^a) 


> 


lim sup 

p(e,»o)— o,«€e„ 
j./<i(®o;0o) 


ll</(g)-«?(go)ll 

p(e,0o) 

- 1/2 


□ 


(3.6) 


This, together with Corollary 1, proves (2.9). 


□ 


Proof of Theorem 2. We use J to replace Jq{0o’, Bq) in this proof. 

Applying Theorem 1 and the condition -frW + 1 t( 9 o) < 2A/‘/*, we have, for 
all 0 e © 0 , that 


and 


H/?r(g)-/?r(eo)|| > 
p(e,6o) 


i-\p^{e,0o) 


lk(g)-«f(eu)ll ..„/2 
p(e,0o) 


(3.7) 


l|/3r(g)-/3r(go)ll 

lk(e)-</(9«)ll 

>i-ip2(e,0„)_2A/*/2. 


Ik(g)-<?(gu)ll 

p{9,0o) 


-I 


(3.8) 
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For e > 0, let 0o(e) := {9 '■ q(9) ^ q(9o),0 < p(6,9o) < e} n0o. By (3.8), for 
e = (2Ad„)>/*. we have 


ll/ir(9)-/h-(eo)ll 
»e«o(0 IkW - 9(<»o)|| 


= (l-A)dM. 


(3.9) 


This proves Theorem 2. □ 

Proof of Corollary 3. Let := 1 — For the case < 0, we use the 

trivial inequality B(A/; t) > 0 and for the case dM > 0, we use Theorem 2 to obtain 
B(A/;r) > (dutr)^. This proves (2.12). □ 

Proof of Theorem 3. Let Af be a positive number such that ||T'(X) - 9(0j)||* < 
A/^ for j = 0, 1, 2, Then, by (2.6), 


2A/ > 


ll«?(g,)-<?(gu)ll 

p(9j,6o) 


\\lh(ei)-lh{9o)\\ 

ll9{e>)-9(flo)|| 


(l-ip^(0y,e„))|. 


Hence. 


1 + p"((?>,e„)/2 - 2A/[||q(0,) - </(0o)||/p(0j,0o)]'' 

<||/?(0j)-/3((?o)||/|Mfly)-g(eo)|| 

< 1 + p^(fl>,eo)/2 + 2M[\\q{ej) - q{6^)\\lp(ejM] 


(3.10) 


(3.11) 


Let j — > 00 , we have the desired (2.13). 


□ 


In i>rdor to prove Theorem 4, we need the following lemma. 

Lemma 2. Suppose V is a subspace of d-dimensional Euclidean space R“* with the 
usual Euclidean norm || • ||. Let 0i = {0i, . . .} C 0 — {So} be a sequence with limit 
point 01) and limj_,x) q(9j) = q(0o)- Then, for any estimator T, 


Es„\\T(X) - 9(eo)||" < A/r(0,). (3.12) 


Proof. (3.12) is automatically true if A/r(0i) = oo. Let us consider the case that 
A/t’(0i) < 00 . Since limj_c,j p(Oj,0o) = 0 as j — » oo, the distribution of T under 
0 = 0j converges to the distribution of T under 0 = 0u. Let us write 


T = (Ti,T2 Trf), 

qW = (<ji(0), 92(e),..., 9 d(fl)), 
th-{0) = EgT{X) = M9)), and 

ut { 0 ) = (vars(7’i),var8(r2),...,var9(rrf)). 

Notice that 


£s||T(A-)-9(fl)f 

= £»||r(Af) - ^{ 0 ) 11 ^ + ||VY(e) - 9 WII' 

d d 

= ^ var8(Ti) + ^ (t/>i(e) - q,(0)f, 

t=l isl 


(3.13) 
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and, since A/r(Bi) < oo, 

\\m = ipi{0o) for i = 1,2, . . . ,d. (3.14) 

j— oc 

By Problem 4.4.9, page 150 of Bickel and Dok.sum (1977), we have 

liminf vare (Tj) > var^o(T<) for i = 1,2, . . . ,d. (3.15) 

j 

With the assumption limj_oo Qi^j) = 9(^o), and (3.13) ~ (3.15), we have 

This proves (3.12). □ 


Proof of Theorem 4- First, if limj_oo Qi^j) = oo, then it is easy to prove that 
sup{£’fl||T(X) — q{0)\\'^ : 9 € 6i} = oo. Next, if limj_oo9(^j) exists, we simply 
change the definition of q{9o) to be equal to limj_oo q{9j)- Under this new definition 
of q, 0Q is still a singular point and 0i is still a slow .sequence of $o. If A/r(^i) < oo» 
then (3.12) implies A/t'(0i U {^o}) = < oo and (2.9) implies 2[Mt{9i U 

{^o})] = oo, a contradiction. This proves A/r(0i) = oo. □ 

4. Comments and examples 

Example 2 of Liu and Brown (1993) shows that the “nonexistence of informative 
unbiased estimators” phenomenon might occur in a quadratic-mean-differentiable 
(QMD) problem with Fisher Information totally bounded away from zero. This 
statement is true if we replace the term “Fisher Information” by “Hellinger Infor- 
mation” since it is well-known that Fisher Information and Hellinger Information 
are equal in QMD problems. Due to the fact that the Hellinger Information number 
J{6) is not necessarily continuous with respect to the Hellinger distance p{6]9i)), 
the condition that Fisher Information (or Hellinger Information) be totally bounded 
away from zero does not exclude the possibility of a singular point as a limiting 
point. If such a singular limiting point exists, by Theorem 4, the “nonexistence of 
informative unbiased estimators” phenomenon could occur. 

Example 4 of Liu and Brown (1994) exhibits an unbiased estimator with finite 
variance at a singular point. The spirit of this example does not contradict the im- 
pression left by the “mean-variance restriction” described in Theorem 1 or Corollary 
1. Obviously, one can modify an estimator so as to obtain an unbiased estimator 
at any predescribed point. The requirement that an estimator have finite variance 
at a predescribed point does not pose any conflict because the “mean-variance re- 
striction” (Theorem 1) places a lower bound on the sum of variances at two points, 
instead of on variances at each point. Further, one could even view this example 
as a validation of the form of “mean- variance restriction” (Theorem 1), in which 
the restriction imposed by sums of variances (or, rather, sums of root mean-square 
risks) is on the difference of the bias function {/3t{9) — 0t{9o)) and not on the bias 
function (/?r(^o)) itself. 

The following example shows that in the bounded normal case, the lower bound 
of Corollary 3 is sharp. This example has been considered by Low (1995). 

Example 1. If X ~ N{6,a^) and q{6) = 6, then J = Jq(0o;0) = ^ for any open 
interval 0 which contains 9o. By (2.9), 

> {[l - A/‘/2-(7-*] A0}V. (4.1) 
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Taj be the affine procedure studied in Low (1995), (2.4), 


Tm(X) = (A/‘/* • (T-> A 1)(X - flo) + Oq- 


(4.2) 


It is easy to show that £ 9 ||Ta/(X) — 0||^ < M A and that 


sup{||A-«(^) -^«(®o)|f I® -Sol < r} = {[l - A()}^r^. 


This, together with (2.12) proves 


B(A/; r) = { [l - A/‘/* • <T-‘] A 0} V. 


(4.3) 


If we compare B(A/;t) with 0^{i/,a,r) in (2.1) and (2.3) of Low (1995), we 
find that B(M\t) = 0^{M,(t,t) in the above Example 2. It is interesting to point 
out that Low’s argument to obtain a lower bound on 0^(u,a,r) is an application 
of the Rao-Cram6r Inequality. This approach, if extended to a general case, would 
require conditions to guarantee the differentiability of the bias function of T. Our 
metfiod, which is ba.sed on Theorem 1, does not require the differentiability of the 
bias function of T. 

Finally, let us exhibit an example of the "nonexistence of informative unbiased 
estimator” phenomenon for discrete 0 without any limiting point with respect to 
p-distance. 

Example 2. Let X ~ Poisson (S) with 0 € N — {1,2, 3, ...}, and r > 1. Suppose 
we want to estimate q{6) = e”*. The square of Hellinger distance is 


It is easy to verify that [1 - jp^(e, l)]||g(6) — 9 (l)||/p( 0 , 1) — • oo for 0 — ► c». 
According to Corollary 1, there exists no informative unbiased estimator for q(0). 
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Combining correlated unbiased estimators 
of the mean of a normal distribution 

Timothy Keller' and Ingram Olkin^ 

National Agricultural Statistics Service and Stanford University 

Abstract: There are many applications in which one seeks to combine multi- 
ple estimators of the same parameter. If the constituent estimators are imbi- 
ased, then the fixed linear combination which is minimum variance unbiased 
is well-known, and may be written in terms of the covariance matrix of the 
constituent estimators. In general, the covariance matrix is unktiown, and one 
computes a composite estimate of the unknown parameter with the covari- 
ance matrix replaced by its maximum likelihood estimator. The efficiency of 
this composite estimator relative to the constituent estimators has been inves- 
tigated in the special case for which the constituent estimators are uncorre- 
lated. For the general case in which the estimators are normally distributed 
and correlated, we give an explicit expression relating the variance of the com- 
posite estimator computed using the covariance matrix, and the variance of 
the composite estimator computed using the maximum likelihood estimate of 
the covariance matrix. This result suggests that the latter composite estima- 
tor may be useful in applications in which only a moderate sample size is 
available. Details of one such application are presented: combining estimates 
of agricultural yield obtained from multiple surveys into a single yield predic- 
tion. 


1. Introduction 

The need to combine tistiniators from different sources arises in many fields of ap- 
plication. In agriculture estimates may come from different experimental stations; 
in the medical sciences there may be multi-sites or multiple studies; sample surveys 
may contain subsurveys at different locations; several laboratories might assay a 
sample of one. Often making a prediction requires the combination of estimators. 
The present analysis was motivated by a model to predict agricultural yield. How- 
ever, the model is generic, and occurs in a variety of contexts. The specifics of the 
application are discussed in Section 5. 

It is perhaps stirprising that the earliest methods for combining estimators were 
nonparametric. Fisher (1932) and Tippett(1931) proposed methods for combining 
/>values obtained from independent studies. Fisher was motivated by agriculture 
and Tippett by industrial engineering. These methods have betm used to combine 
the results of independent studies in meta-analysis. 

The parametric problem was first posed by Cochran (1937), who was also mo- 
tivated by an agricultural problem. For simplicity suppose that we have two esti- 
mators T\ and T 2 of 0 from a A7(0, o-j) and M{d, 02 ) population, respectively. The 


*U.S. Department of Agriculture, National Agricultural Statistics Service, 3251 Old Lee High- 
w'ay, Fairfax, V.A 22030-150'!, USA. e-inail: keller-timothy®hotmail.com 

^Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, C.A 
9430.5-406.5, USA. e-mail: iolkinOstat.stanford.edu 

Keywords and phrases: meta-analysis, unbiased estimators, correlate<l estimators. 

AMS 2000 subject classifications: 62H12, 6211 10. 


218 


Copyrighted material 


Combining correlated tinbiased estimators 


219 


combined estimator 

T = wiT\ + W 2 T 2 (1-1) 

with 

■^)> ^‘2 = ^2 + ^2^) (1-2) 

is unbiased and has variance 

Var{T) = 2 ' 2 - (<^i»^2)' {^-^) 

<T j "I" (72 

Consequently, the combined estimator dominates either single estimator in terms 
of having a smaller variance. 

In practice the variances are unknown, and estimates (Ti,<T 2 independent of 
Ti,T 2 , are substituted in w\ and u? 2 , that is, 


= iviTi + W 2 T 2 . (1.4) 

Of course, now the variance of T* is no longer minimum variance, but it is unbiased. 

Cochran’s paper was the genesis for a sequence of papers to study the effect of 
using estimates of the variances. We briefly dtnicribe the.se in chronokigical order. 
Graybill and Deal (1959) .started with the Cochran model and assumed that the 
estimators <7i and independent and that e<u h arises from a sample of size 

larger than 9. Under this condition, they show that T* is uniformly better than 
either Ti or T 2 , where better means smaller variance. 

Zacks (1966) starts with the assumiJtion that the ratio p = (Tl/crf is unknown 
but is estimable, and creates an estimator 

T"> = 03r, +T 2 )/(p+ 1 ), (1.5) 

where p is independent of T\ and T 2 . Then is unbiased. The efficiency of 
cannot be given in closed form, and Zacks (1966) provides graphs of the efficiency 
relative to the estimator with p replacing p. 

Seshadri (1974), motivated by balanced incomplete block (BIB) design consider- 
ations, assumes that there is an unbiased estimator 6 of the ratio b = o\l{a\ + (71), 
independent of T\ and T 2 . Then the e.stimator 

T<2) = (1 -6)T, +tr2 (1.6) 

is unbiased, and var < min {vavT\,iKirT 2 ) provided Var b < and Var 
(1 - 6) < (1 — ft)^. The key point is that in certain BIB designs there is an intra- 
block and inter-block estimator, and also an estimator b. 

When the sample sizes of the two samples are ecjual to n, Cohen and Sac^krowitz 
(1974) di.scu.ss estimators of the form 

T<'')=(^iTi+d2T2, (1.7) 

where Qj are functions of sample variances and are chosen with respect to a squared 
error loss function normalized by (7j. They determine the sample size n for which 
is superior to either T\ or 72. 

Becau.se the estimators Tj of the mean and sj of the variances are location and 
scale estimators, Coh(*n (1974) considers a location-scale family as a more general 
construct than the normal family. Again, the combined estimator is 

= b \ T \ + 62T1, /q + 62 = 1) (1.8) 
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where now 62 = ccr\l{a\ + iTj), c is a suitably <^iosen constant, and o\ and &2 
appropriately chosen estimators. 

The extension from combining two estimators to combining k estimators from k 
normal populations i = 1, . . . , fc, is discussed by Norwood and Hinkelmann 

(1977). Here 

7^®’ = u;iri+ VWkTk (1.9) 

with w, = (t“^/ a~^. They show that var (T*®)) < min (ear TJ if each sample 

size is greater than 9, or if some .sample size is equal to 9, and the others are greater 
than 17. 

For the case k = 2 Nair (1980) computes the variance of T‘ as an infinite series, 
as a function of two parameters, cr, and q = ni<T|/n20'2. Of course, it Ls symmetric 
and can lx? restated as a function of cr^ and 1 /a. 

Following the formulation of Cohen and Sackrowitz (1974), Kubokawa (1987) 
provides a family of minimax estimators under normalized quadratic loss functions. 
Green and Strawderman (1991) also consider quadratic loss and provide a James 
Stein shrinkage estimator. The use of a quadratic loss function is extended to the 
multivariate case by Loh (1991), where now we have normal populations Ei) 
and A/^(0, E2). As in the univariate case, there are estimators 81,62 of the mean 
vectors and independent covariance matrix estimators 5i, 52, each having a Wishart 
distribution. For the loss function 

T(0,e,Ei,E2) = (fl-0)'(sr‘ + sj‘)(»-o). (110) 

with El and E2 known the estimator 

e= (5f‘ +S2“')~'(Er'e, +EJ‘«2) (1.11) 

is shown to be best linear unbiased. 

The model t hat we here consider is that there are k normal populations M{8, af), 
i = 1, . . . , fc. This model was considered by Halperin (1961) who provided an exten- 
sive analysis in which the estimator of 6 is a weighted combination of the individual 
means,which are permittterl to be correlated. For this nuxlel Halperin (1961) obtaiiLs 
the same variance as given in (2.8) below. In the present analysis the estunator of 9 
is a weighted combination of any unbiased estimators, and thereby permits .some- 
what more flexibility. Our derivation makes use of invariance arguments. In a later 
paper, Krishnamoorthy and Rohatgi (1990) show that the simple arithmetic mean 
is dominated by a shrinkage estimator that takes advantage of the variances. 

2. The correlated case 

As our starting point suppose that the data available are k unbiased estimators 
7’i,...,Tjt of 6. However, the vector T = {Ti, . . . ,Tk) has covariance matrix E, 
for which there is a sample covariance matrix 5 having a Wishart distribution 
W(E; k, n). Furthermore. 5 and (Ti , . . . Tk) are independent. 

When E is known, the linear estimator 


6 = wiT,-l hWkTk, uti -I l-uj* = l, (2.1) 

with w,,i = l,...,k, fixed is unbiased. Let w = (tui, . . . , ui*)' and e = (!,...,!)'. 
For the choice 

u?'= (e'E-‘)/(e'E-*e), (2.2) 


Copyrighted material 


Combtmng correlated unbiased estimatora 


221 


6 is also minimum variance unbiased. Furthermore, 


Var(e) = 


{e'S-'c)2 



(2.3) 


That Var{0) is miiiinium variaiiro follows from I ho Cauchy Schwartz iiic<iuality: 

(u''52u') > (w'c)^ = 1 (2.4) 

with equality if and only if (2.2) holds. .Also, 

(c'E~'e) * < min{(rj, . . . (2.5) 

which follows from (2.4) with w = e, = (0 0, 1,0, . . .,())'. 

When E is unknown it is (%>titnated by S. and we consider the candidate esti- 
mator 

0= (f'S-'7’)/(e',S'-'c). (2.6) 

The estimator 0 Ls unbiasrsl and has variance 


Var{0) 


€s£t- 


■'5’-‘[(r-ec)'(r-gr)]y-‘e 


(f'S-‘e)2 


£s 


c'S-'ES-'( 

(p'S-ie)2 


In the next section we provide a proof of the basic result: 


(2.7) 



Var(0). 


( 2 . 8 ) 


3. Proof of the main result 
The VV'ishart density of S is 


/(5) = C(k,n) I E S ^ > *>• (3.1) 

where 

and E > 0 (that is, E is jmsitive definite). 

Let Y = E" 4 sE* 4 , SO that tlie density of V is 


f(y 


With 6 = E-lc 


Var(0) = £ 


exp(-ifr t). 

T >0. 

(3.2) 

b'Y-% 



(3.3) 

(b'l'-'&)2 




Becaiuse the dettsity (3.2) is orthogonally invariant, that is, C(G'YG) — C{Y) 
for any orthogonal matrix G, a judicioms choice of G allows one to put (3.3) in a 
more convenient form. Let ci = (1,0, . . . ,0)', and choose G .so that the first row 
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of G is b' / \fUb and the remaining A: — 1 rows of G complete an orthonormal basis 
for G. Then, by construction, Gb = \ZUb e\. Consequently, with Z = G'YG, (3.3) 
becomes 


Var{e) = £ 


e[Z~^ei 
L(e'Z-iei)2 




Note that b'b = e'E *e, and recall that Var{9) = e'E *e, so that 


Var{6) = £ 


e\Z 2ei 
.{e\Z-^e,)\ 


Var{6) 


(3.4) 


Remark. For any vector a of unit length, and a positive definite matrix B, a'B^a > 
(a'Bci)^. Hence (3.4) demonstrates that Var{9) > Var{9) under the hypothesis that 
S and T = (Ti, . . . ,Tjt)' are independent, but with no distributional assumptions 
on S or T. 


Now the task of proving the theorem is reduced to computing the expectation 
on the right side of equation (3.4). Towards that end, partition the k x k matrix Z 
and its inverse as 


Z = 


( 


zn 

2l 



Z~^ = 


zn z[ \ 
z\ Z 22 ) ’ 


where Z 22 and Z 22 are both {k — 1) x {k — 1). 

In what follows we make use of well-known relationships between the blocks of 
Z and Z~^. (See, for instance, Anderson, 2003.) Employing these relationships, and 
that (/ - uu')~^ = I + the expression inside the expectation brackets in (3.4) 
can be written as: 




Zu + z{h 
*11 


= 1 4 - ^11 u'ZooU, 


(3.5) 


(e',Z-iei)2 

where u = Z^^^^zxf then (3.4) becomes: 

Var{9) = [1 + £^(211 V^ar(0). (3.6) 

The density of Z has the form (3.2), which can be written as 

/(^225 2ii,u) 

= C{k,n)\Z22f^ ^ ■ (3-7) 


Again, using orthogonal invariance, the expectation in (3.6) is 

£[z\\u' Z 22 = C{k,n)Iil 2 h, (3.8) 


where 

/i = 

I 2 = 
h = 
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The integral /2 can be evaluated using polar coordinates; it is also a Dirichlet 
Integral of Type-I. (See Sobel, Uppuluri and Frankowski, 1977). To simplify notation 
in I 3 let Q = Z 22 , so that Q is a (k— 1) x (k— 1) matrix having a Wishart distributon 
W(I;k— l,n). Then I 3 = S(Q~^)n/C(k — l,n). But this expectation is known (see 
e.g. Kshirsagar, 1978, p. 72) so that 


, [, n- k + l\ 

h= {n-k)2^ Tt ^ r( 1 


T -1 


Combining these results we obtain 


Var{9) = (1 + hhh) Var{e) = 


n — 1 
n — k 


Var{e). 


(3.9) 


(3.10) 


4. Discussion of efficiency for k = 2 and n = — 1 

The result that Var {9) = (^) coincides with what intuition suggests: 

when A: = 1, Var (9) = Var (9); when k > 1, Var (0) > Var (9), and for all k, 
liruN-^ 00 Var (^) = Var (9). But the result gives more precise information that 
helps one to assess the efficiency of the Graybill-Deal estimator for a given sample 
size. 

Consider the case k = 2, N — n— 1. If, without loss of generality, we take <7n = 
min { 0 - 11 ,^ 22 }, then Var {9) < min (frn,a 22 } when 


1 ^ (< 7 ii - <712)'-^ 

N — 3 1^22 ~ <^?2 


(4.1) 


In the .special case for which cov (Tj,T 2 ) = 0, (4.1) is 1/{N — 3) < cr\ija 22 < fy 
which implies that Var{9) < min (cti 1 , 0 - 22 ) for all N > 5. Note that this does not 
contradict the previously quoted result of Graybill and Deal (1959); their hypothesis 
allows A^i and A' 2 , the sample sizes for the respective constituent t?stimators, to be 
unequal; whereas the current theorem w-as derived under the assumption that N\ = 
N 2 = N . When T\ and T 2 are uncorrelated, there are corresponding sample sizes 
A^i and N 2 used in estimating the variances. However, w'hen the T’s are correlated, 
the covariance matrix is estimated from a single sample of size N. 

Writing an = ft^cr22)0 < o < 1, and denoting the correlation betw'een Ti and 
T 2 by p, (4.1) can be written as 


1 < (o - P)^ 

N - 3 - 1 - p2 • 


(4.2) 


Then it is apparent that if one varies the parameters a and p so that a — p 0, 
the sample size N necessary for (4.2) to hold increases without bound. But this 
also is intuitive: a — p — > 0 is equivalent to 9 ^ T\. Given a rough initial estimate 
for the parameters a and p, one may use (4.2) to obtain some idea whether the 
Graybill-Deal estimator dominates the better of the tw'o constituent estimators for 
a given sample size. 

Taking the special case an = (T 22 , (4.2) becomes 


^< 1 ^. 

N -3 - l + p 


This form of equation (4.1) implies that the sample size for (4.1) to hold increases 
without bound as p — ♦ 1. Once again, this is intuitive: to say p is close to 1 means 
the estimator T 2 provides es.sentially the same information about 9 as the estimator 
T\, and hence the composite estimator cannot be expected to provide much more 
information than that provided by Ti alone. 
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5. An agricultural application: Forecasting yield 

The National Agricultural Statistics Service (NASS), an agency of the USDA, makes 
monthly pre-harvest yield forecasts for the major US agricultural commodities at 
several geographic levels. In the final analysis, the official forecast of yield announced 
to the public is necessarily the result of a mixed process of both objective scientific 
technique and subjective expert judgment. Nevertheless, subjective expert judge- 
ment Ls most effective when it has an objective estimate of yield with which to 
commence its ojjeration. Given an historical data series for the most important es- 
timators of yield, and the corresponding published final yield, one can estimate the 
covariance structure and biases for those estimators. These are then the basis for 
computing a composite estimate of yield. The question of how be.st to use historical 
data to estimate the biases in the constituent estimators of yield is important in 
itself. In order to avoid a long digression, w'e pick up the discussion of the applica- 
tion at the point wdiere a ‘bias correction’ has already been applied to the historical 
data: hence only the problem of estimating the covariance matrix remains. 

Table 1 presents the predicted yield basc<l on a biological yield model (Ti) and 
the predicted yield based on a survey of producer expectations (72). These data 
have been masked for security considerations. Make the following assumptions: 

(1) The true yield ft for year i is the yield published by NASS (Table 2) at the 
end of the growing season. 

(2) Ti and T 2 are independent. 

(3) The covariance matrix is assentially corrstant over time. 

Under these assumptions the maximum likelihood estimator for the covariance ma- 
trix based on the data in Table 1 is: 


Table 1: Predicted yields (weight per area) of commodity Z for state X in month V'. 


'\'ear 

Survey of biological yield 

1 

88.0 

87.5 

2 

82.5 

80.0 

3 

83.0 

86.5 

4 

73.5 

79.0 

5 

79.0 

84.5 

6 

82.0 

83.5 

7 

83.0 

79.8 

8 

80.8 

84.0 

9 

81.0 

83.0 

10 

79.0 

79.0 

11 

64.0 

76.0 

12 

80.5 

83.8 

13 

83.0 

87.0 

14 

81.5 

78.5 
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/ 9.50 2.19 \ 

* 2.19 15.30 ) ’ 

and the vector of weights for the linear combination of T\ and Tj whidi is the 
Graybill-Deal estimator of yield is w' = (0.642, 0.358). 

A word about the operational implementation of these ideas is in order. It 
is unreasonable to expect that the assumptions underlying the estimate of the 
covariance matrix hold for all time; hence, in practice, one envisions that yield data 
from a ‘moving window’ of N past years would be used to estimate the vector of 
coefficients, w, used to compute the composite estimate of yield for the current 
year. This concept has been tested by a cross-validation scheme in which each of 
A -1- 1 years is sequentially treated as the ‘current’ year, and the remaining N years 
are treated as the ‘past’, where IV -1- 1 is the length of the relevant data series which 
is available; but, for the sake of a .simple expasition, the calculations presented in 
Table 2 are based on ^dl 14 years of data at once, the results of the cross-validation 
shceme being very similar. 

Looking at Table 2, one notes that the root mean square error for the compos- 
ite estimator was less than that of either constituent estimator of yield, and only 
slightly larger than the root mean .square error for the yield forecast produced by 
the panel of commodity experts. Given that this panel was privy to a great many 
sources of information relevant to setting yield, in addition to the constituent es- 
timators of yield, this is a remarkable result. One cannot hope to replace expert 
judgement with statistical methodology; nevertheless, these results demonstrate 


Table 2: 


Year 

Composite 
Estimate {0) 

Panel of 
Experts 

Final Published 
Yield {0) 

1 

87.8 

89.5 

87.8 

2 

81.5 

82.5 

87.3 

3 

84.2 

85.8 

85.3 

4 

75.3 

76.3 

76.8 

5 

81.3 

83.3 

78.3 

6 

82.5 

83.8 

89.0 

7 

81.8 

85.0 

82.5 

8 

81.8 

81.3 

84.0 

9 

81.7 

81.8 

82.3 

10 

79.0 

81.0 

80.8 

11 

68.3 

67.5 

68.3 

12 

81.6 

83.0 

83.0 

13 

84.4 

85.0 

85.0 

14 

80.4 

82.0 

81.8 


Root Mean Square Error: 
Farmer Reported Yield 3.06 
Biological Yield Model 3.92 
Composite Estimator 2.68 

Panel of Experts 2.58 
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that tile techniques of composite estimation can provide a luseful starting point for 

the overall process of setting a yield forecast. 

Acknowledgement 

The authors are grateful to Brian Cullis for many helpful comments and insights 

relating to t he paper by Halperin (1961), and to the referee for helpful suggestions. 

References 

[1] Anderson, T. W. (2003). An Introduction to Multivariate Statvitical Analysis, 
3rd edition. John Wiley and Sons, New York. MR1990G62 

[2] Cochran, W. G. (1937). Problems arLsing in the analysis of a series of similar 
experiments, Supiilement to th<; Journal of the Royal Statistical Society 4 102- 
118. 

[3] Cohen, A. (1976). Combining estimates of location, Journal of the Ameiican 
Statistical Association 71 172-175. .MR426258 

[4] Cohen, A. and Sackrowitz, H. B. (1974). On estimating the common mean of 
two normal distributions. Annals of Statistics 2 1274-1282. MR365851 

[5] Fisher, R. A. (1932). Statistical methods for research workers (4th ed.) Oliver 
and Boyd, London. 

[6] Graybill, F. A. and Deal, R. B. (1959). Combining unbiased estimators, Bio- 
metrics 15 543 5.50. Mill 07925 

[7] Green, E. J. and Strawderman, W. E. (1991). A Janu^-Stein type estimator for 
combiTiing imbia.scd and pos.sibly bifvsed estimators. Journal of the American 
Statistical Association 86 1001-1006. MRl 146348 

[8] Halperin, M. (1961). Almost linearly-optimum combination of unbiased esti- 
mates. Jounuil of the American Statistical Association 56 36 43. MR141181 

[9] Krishnamoorthy, K. and Rohatgi, V. K. (1990). Unbiased estimation of the 
common mean of a multivariate normal distribution. Communiaitions in Sta- 
tistics - Theortf and Methods 19 (5) 1803-1810. MR 1075503 

[10] Kshirsagar, A. (1978). Multivariate Analysis, Marc(4 Dekker, Inc., New York. 
MR34.3478 

[11] Kubokiiwa. T. (1987). Admissible minimax estimation of a common memi of 
two nonmU populations. Annals of Statistics, 15 1245-1256. MR902256 

[12] Loll, W.-L. (1991). Estimating the common mean of two multivariate normal 
distributions, Annals of Statistics, 19 297-313. MR1091852 

[13] Nair, K. A. (1980) Variance and distribution of the Graybill-Deal estimator of 
the common mean of two normal jxipulations, Annals of Statistics 8 212-216. 
MR557.567 

[14] Norwood, T. E. and Hinkelmann, K. Jr. (1977). Estimating the common mean 
of several normal {lopulations, Annals of Statistics 5 1047-1050. MR148679 

[15] Raj, D. (1998). Sampliny Theorij, McGraw-Hill, New York. MR230440 


Copyrighted material 


Combining correlated unbiased estimators 


227 


[16] Seshaclri, V. (1963). Constructing uniformly better estimators. Journal of the 
American Statvitical Association 58 172-175. MR145628 

[17] Sobel, M., Uppuluri, II., and brankowski, K. (1977). Selected tables in math- 
ematical statistics, Vol. 4: Dirichlet Distribution - Type 1, American Mathe- 
matical Society, Providence, Rhode Island. MR423747 

[18] Tippett, L. H. C. (1931). The method of statistics. Williams and Norgate, 
London. 

[19] Zat:ks, S. (1966). Unbiased («timation of the common mean of two normal 
distributions biistxl on small samples of equal size. Journal of the American 
Statistical Association 61 467-476. MR2071()0 


Copyrighted material 


A Fcistschrift for Herman Rubin 
Institute of Mathematical Statistics 
Lecture Notes — Monograph Series 
Vol. 45 (20(M) 228 236 
0 Institute of Mathematical Statistics, 2004 

An asymptotic minimax determination of 
the initial sample size in a two-stage 
sequential procedure 
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Abstract: When estimating the mean of a normal distribution with squared 
error loss and a cost for each observation, the optimal (fixed) sample size 
depends on the variance tr^. A two-stage sequential procedure is to first conduct 
a pilot study from which <r* can be estimated, and then estimate the desiretl 
sample size. Here an asymptotic formula for the initial sample size in a two- 
stage sequential estimation procedure is derived -asymptotic as the cost of a 
single observation becomes small compared to the loss from estimation error. 

The experimenter must specify only the sample size, no say, that would be 
used in a fixed sample size experiment; the initial sample size of the two- 
.stage procedure is then the least integer greater than or equal to The 

resulting procedure is shown to minimize the maximum Bayes regret, where 
the maximum is taken over prior distributions that are consistent with the 
initial guess nu; and the minimax solution is shown to provide an asymptotic 
lower bound for optimal Bayesian choices for a wide class of prior distributions. 


1. Introduction 

It Ls indeed a plea.sure to offer this tribute to Herman Rut>in and to ponder his 
influence on my own work. I still remember the interest with which I read the 
papers on Bayes’ risk efficiency [7] and [8] early in my career. From reading these 
papers. 1 gained an appreciation for the i>ower of statistical decision tlieory and its 
interplay with a-symptotic calculations that go beyond limiting distributions. These 
involved moderate deviations in the case of (7]. A central idea in [8] is the study of a 
risk functional, the integrated risk of a procedure with respect to a prior distribution 
that can var\' over a large class. I have asetl this idea in a modified form in work 
on sequential point estimation and very weak expansions for sequential confidence 
intervals — [12, 13, 14], and the references given there. This idea is al-so present in 
Theorem 2 below. The connection between [12] and Bayes risk efficiency is notable 
here. The following is proved in [12], though not i.solated: Suppose that it is required 
to estimate the mean of an exponential family with squared error loss and a cost for 
each olwervation and that the population mean is to be estimated by the sample 
mean. Then there is a stopping time which is Bayes risk non-deficient in the sense 
of [4]; that is. it minimizes a Bayesian regret asymptotically, simultaneously for all 
sufficiently smooth prior distributions. 

The present effort combines tools from decision theory and asymptotic analysis 
to obtain a rule for prescribing the initial sample size in a two-stage sequential 
procedure for estimating the mean of a normal distribution. Unlike the fully -se- 
quential, or even three-stage, versions of the problem. Bayes risk non-deficiency is 

^Statistics Department. University of Michigan. Ann Arbor, Michigan 48109, ITSA. e-mail: 
DlchaeluCstat . lsa.ualch.edu 

Keywords and phrases: Bayesian solutions, integrated risk and regret, inverted gamma priors, 
sequential point estimation, squared error loss. 

AMS 2000 subject classifications: 62LI2. 
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not potisible with two-stage prore<iures, and tlie rule is obtained from minimaxity. 
The problem is stated in Section 2. and the minimax solution is riefimxi. Tlie rule 
re<iuires the statistician to specify only tlie fix«i sample size, no say, that would 
have been userl in a fixed sample size design, or to elicit such from a client. The 
in ininia x initial sample size is then the least integer that is greater than or equal to 
Y/no/2. The proof of asymptotic mitiimaxity is provided in Sectioti 3. As explainer! 
in Section 4, tlu- minimax solution is very coiLservative but, at least, provides an 
asymptotic lower bound for optimal Bayesian solutions for a wide class of prior 
distributions. 


2. The problem 

Let Xi, A' 2 , . . . Normal[/i, <r^|, where — oc < /i < oc and <t > 0 are unknown, and 

cotusider the problem of estimating /i with loss of the form 

/,„(n) = - Ilf + n. (1) 


where An = {Ai -!-• • • + A„)/n. In (1), a^(A„ - represents the loss due to 
estimatioti error, atid n the cost of sampling. The units are so chosen that each 
obser\Tition costs one unit, and o is determiru'd by the importance of estimation 
error relative to the cost of samplitig. Also, the estimator has b«>eti sjiecifieri as A„, 
leaving only the sample size n to lie determined. If cr were known, then the expected 
loss plus sampling cost, £'n,„j[/,„(n)] = jn -t- n, would lie minimized wlien n is 
an int«>ger adjacent to 

A' — 0(T, 


and in many ways the problem is one of estimating A'. This will be done using the 
sanijile variances 


S'. = 



- Xn) 


2 


for n > 2. hiterest in two-stag<‘ serjuetitial procedures for estimatioti originated with 
Stein’s famous (laper [9]. The problem has a long historj’, much of which is included 
iti Chapter 6 of [5], but there seems to be no general agrerunent on the choice of 
the initial sample size rn in two-stage procerlures. Some additional references are 
provided in the bust section. 

A two-.itage procedure cotLsists of a pair S = (m, N) where m > 2 is an integer 
and N = N{S^) is an integer valued random variable for which N > m. The 
estimator of /t is then A';y. For example, letting fi] denote the least integer that Ls 
at least x, 

N„ = max{»n. } (2) 


satisfies the conditioius for any m > 2. The choice of m has to be subjective at 
some level, because there is no data available when it Ls chostm. Here it is required 
only that the experimenter specify' a prior guess, u say for <t, or even just the guess 
Tii) — au for N. This seems a very modest r(X|uirement, sitice a fixtxl sample size 
experiment would have to include a prior guess for N. Given the prior guess, it is 
shown that 


m,. 


= max 



(3) 


learis to a procixlure that minimizes the inirxiinum Bayes' regret in the class of prior 
dLstributiotis for which rr has prior mean u. 
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3. The theorem 

The risk of a two stage procedure S = {m,N) is Ra{S]cr^) = Ef^ a 2 [La{N)\- Using 
the Helmut transformation (for example, [11, p. 106]), it is easily seen that 


Ra{S;(T^) = E„2 





+ N , 


( 4 ) 


which depends on but not on p. The difference 


ra(^,<r^) = E„2 


~w 


+ yv 


-2N, 


is caller! the regret. 

If ^ is a prior distribution over [0, oo), write and E^ for probability and expec- 
tation in the Bayesian model, where ~ ^ and 51,51,... are jointly distributed 
random variables: and write and E’^ for conditional probability and expecta- 
tion given 5^. Then the integrated risk of a two-stage procedure 6 with respect to 
a ^ is 


Jo I . 


po.ssibly infinite; and if (r^{da^} < oo, then the integrated regret of S with 
respect to ^ is 


= f ra{6]a^)^{(l<j^} = E^ 

Jo 


N 


+ N-2N 


again possibly infinite. As noted above, the experimenter must specify E^{N), or 
equivalently, E^{cr). In fact, it is sufficient for the experimenter to specify an upper 
bound. For a fixed ti € (0, oo), let E = be the class of prior distributions for 
which 

ar^{da^} < u; (5) 

and let E" = E" be the class of ^ for which there is equality in (5). Also, let be 
the procedure {nia, Na) defined by (2) and (3) with no = au. 

Theorem 1. For any u € (0, oo). 



inf supr(<5;^) v/^ ~ supf((S“;^) 

«€H 


as a —* oo. 


Proof. The proof will consist of showing first that 

limsupsup -^f(6“;^) < 

a—»oo 


(6) 


and then that 


liminf sup inf —=r(6:E) > \phi 

a—OO Sy/a 


( 7 ) 


This is sufficient, since inf^sup^gs > sup^g=o inf,j. In the proofs of (6) and (7), 
there is no loss of generality in supposing that w = 1. 
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The Upper Bound. FYom (4) and (2), 

1 


< aa^Ef,2 


+ a£>(5„,„) + ma + 1. 


Here 

where 


E„ 2 {Srn) = C{m)a, 

r(f) 


( 8 ) 


( 9 ) 


C(m) = 


/ rn — l r>/ tn— I \ 

V 2 ^ V 2 ^ 
and r is the Gamma-function; and, similarly, 

fm — 1 1 


w- 


rn — 2 C{rn — 1)(t 
A version of Stirling’s Formula asserts that 

logr(z) = ^2 - 0 log( 2 ) -z + ^ log(27T) + 

as 2 — » oo. See, for example, [1, p. 257]. It then follows from simple algebra that 

C(™) = ,-± + o(;l). 

Let a be so large that m„ > 3. Then, combining (8) and (11), 


( 10 ) 


( 11 ) 


/?„ ((5“; (7^) < aa 


hria - 1 


1 


m„ - 2 C(m„ — 1 ) 


+ C(m„) 


+ riia + 1 


= 2a(T + 


a<j 

2m„ 


-t- rUn -I- 1 -t- acr 




where 0(l/m) is a function only of rn. So, for every ^ 6 H = :zi, 


^a(^“;0 < TT— + rua + 1 -I- o X < \/^ + 0(1), 

2m„ \rnlJ 

establishing (6), since tiq = 2a when u = 1. 

The Lower Bound. The lower bound (7) will be established by finding the Bayes 
procedure and a lower bound for the Bayes regret 


raiO = infra((5;0 


for a general prior distribution ^ and then finding priors € H" for which 

liminfa-.oora(^a)/>/a > \/ 2 . 

Finding the Bayes procedure is not difficult. If the initial sample size is rn > 2, 
then N should be chosen to minimize the posterior expected loss + n] 

with respect to n. Clearly, 




n 


-f- n 



( 12 ) 
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where 

V„ = E"^‘{a‘) 

So, (12) is miuiinizocl wiien n is the larger of m and an integer adjacent to o\/Vm, 
leaving 

fa (0 = inf I 2a \/\^ -h—{m- a y/V^ ) ^ + r;(a, m) | - 2a, 

where (x)^ denotes the square of the positive part of x and 0 < i){a,m) < l/m. 
An alternative expression is 

fa{0= inf E^/2a[v^- f/m] + —(ni - + //(«, w)l, (i3) 

m>2 •' tn J 


where 

= E^{a) 


and E^{Um) = E^{(t) = 1. 

Suppose now that ^ is an inverte<l Gamma prior with density 


1 

up) 




exp 




(14) 


where q > 1 and 0 > 0. p]quivalently l/cr'^ has a Gamma distribution with para- 
meters a/2 and (3/2. Then 


E(a) = 


r(^) [0 

r(la) V2' 


(15) 


Letting 





and applying (15) to the posterior distributions then leads to 


U,n = 


r(g±f^) I P + tr„ 

r( “t'^"-i ) V 2 


and 

V.n = (<’') = = ■«(“ + X (*«) 

^ Q -f m — o 

where 

In order for the ^ of (14) to be in = E", a and 0 mast be so constrained that 
the right side of (15) e<iuals one. Then E^{Um) = 1, Ec (y/V„i) = B{a -f w — 1), 
and 


r«(0 


= inf^ |2a [B{a + rn - 1) - l]t/,„ -1- ^ (m - a\/v^)^ + r/(a, rn) 
> iiT^ |2a[/?(a + m — 1) — l] -1- (1 — c)^mP<c [a\/v^ < em] | 


for any e > 0. 
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Ol)serve that B{a + m — 1) is positive and bounded away from 0 for 0 < a < 1 
for each fixed m > 2. It follows that the term in braces on the right side of (13) is 
of order a for each fixed m > 2 when ^ is an inverted gamma prior with 0 < a < 2 
and, therefore, tliat the minimizing m = 111 ,, approaches 00 as a 00 . So, infm >2 
in (13) can be replaced by inf,„>,„j, for any tiiq for all sufficiently large a. 

The marginal distribution of VTm is of the form 


for all 1 < ft < 2, m > 2 and c > 0. Let bo an inverted gamma prior with 
= o(a”^) and determined by the condition that E^^{a) = 1. Then — > 1 <us 
a — > 00 . If e > 0 is given, then 


for all sufficiently large a. From (11) and (17) there Is an mo for which B{rn) > 
/ + (1 — e)/4m for all m > mo — 1. Then 


4. The ininimax solution as a lower bound 

The minimax solution is very conservative in that it specifies a very small initial 
sample size. For exainjile, if the prior guess for the best fix(‘d sami)le size is 100, 
then the asymptotic minimfix solution calls for an initial sample size of only 8; and 
if the ])rior gue.ss is increast'd to 1000, then the initial sample size increases only 



where 



Clearly, 



for all c > 0. So, there is a constant K for which 




for all m > 3 and sufficiently large a. It follows that for any mo > 3, 



(18) 


r(^„)>(l-c) inf ;^ + (l-c)^m 

m>mo Z7U 


for all sufficiently large a. Relation (7) follows since c > 0 vn\s arl)itrary. □ 
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to 23. The asymptotic miniinax solution approximates the Bayes procedure when 
the (T is small with high probability, but still has a fixed mean, as is clear from 
the nature of the inverted gamma prior that was used to obtain the lower bound. 
A stati.stician who can specify more about the prior distribution will take a larger 
initial sample size for large a and incur a smaller regret. For example, if cr > o-q > 0 
with prior probability one, then the optimal initial sample size is at least aao, and 
the Bayes regret is of order one as a — > oo, assuming that a has a finite prior mean. 
A more detailed study of the asymptotic properties of Ba^'^es jjrocedures suggests 
that optimal m is closely related to the behavior of the prior density near = 0, 
a relationship that might be difficult to specify or elicit from a client. The inverted 
gamma priors (14) are an extreme case since the prior density approachas zero very 
rapidly as (7^ —>■ 0 in this case. An advantage of the asymptotic minimax solution, of 
course, is that it does not require the statistician to elicit detailed prior information 
from a client. 

Tlie following result shows that the asymptotic minimax solution (3) provides 
an asymptotic lower bound for optimal Ba}'esian solutions for a very large class of 
prior distributions. 

Theorem 2. Suppose that^{0} = 0, that ^ has a continuously differentiable density 
on (0, oo), and that < oo. Let m* = m*(^) be an optimal initial 

sample size for Then 

ITl* 

lim — ^ = 00 . (19) 

a — oo y/a 

Proof. As above, there is no lo.ss of generality in supposing that 
By (13), 

faiO = iiif [2rtfem + Cm (a) + r;(a,m)l, 

m>2 

where = E^[\/Vif, - U,n], Cm(«) = E^[{m — a\/V^)^]/m, and 0 < 7/(a,m) < 1/m. 
Then ^ 

2«[tm; ~ ^ C2m;(c) + ' 

since 2ab,„ +Cm(«) + ^{o,in) is minimized when m = m* and 0 < r;(a, m) < 1/m. 
By Lemmas 1 and 2 below. 


c,n(a) <mP^ 


m 

a < — 
a 


( 20 ) 


and 

for .some e > 0 


b„,-b2r„>- (21) 

m 

that does not depend on m. Combining the last three erjuations, 

1 


2ae ^ ^ 

— < 2m„P^ 


m 


a 


+ 


2m; 


and, therefore. 

UlsL > / ^ 

y/d Y 2P^\a < 2m*/a] ’ 

for all sufficiently large a. Relation (19) follows directly, completing the proof, except 
for the proofs of the lemmas. □ 


Lemma 1. Relation (20) holds. 
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Proof. Using Jeiisen's Inequality twice, (m—avA^)+ < \E™{,in—aa)^ < 
ao)%\. So, 


Cm(a) < - o < t )+] < ”‘P( 


m 

<T < — 

a 


as asserted. 

Lemma 2. There is an € > 0 for which relation (21) holds. 


□ 


Proof. Since E^{Um) = E^la) for all rn, b„, — 62m = E^[-^Vm — Next, since 

V™ - V 2 m = 2vA^(v/t^ - v^) - (v^ - and - V'„) = 0, 


btn ~ b'lm = £( 


WVlm — VKn)* 


Prom Laplace’s method, for example, [6], 




■(^) 


le.p.l {Pal) for each <r^ > 0 and, therefore, u'.p.l (P{). That “ V^m) 

has a non-degenerated limiting distribution follows directly, and then 


lim inf mEt 




2v/t^ 


>0 


by Patou’s Lemma. Relation (21) follows. 


5. Remarks and acknowledgments 

The smoothness condition on the prior in Theorem 2 can probably be relaxed. In 
the proof, it was used to derive the relation Kn — = 0(1 /m) ic.p.l, and this is 

a smaller order of magnitude that is needed. 

If ^ is an inverted gamma prior with a fixed ct > 1 and /? > 0, then 

rr^ " 0[v/bi(^]. 

This may be established by combining techniques from the proofs of Theorems 1 
and 2. 

Bayesian solutions to two-stage sequential estimation problems have been con- 
sidered by several authors — notably (2, 3], and (10). 

The normality assumption has been used heavily, to suggest the estimators for 
p and (T^ and also for special properties of these estimators in (4), (9) and (10). 
It is expected that similar results may be obtained for multiparameter exponential 
families and other loss structures, and such extensions are currently under investi- 
gation in the doctoral wt>rk of Joon Lee. Extensions to a non-paramet ric context 
are more speculative. 

It is a pleasure to acknowledge helpful conversations with Bob Ke<‘ner, Jck>ii 
Sang Lee, and Anand Vidyashankar and helpful comments from Anirban Dasgupta. 
This research was supported by the National Science Pomidation and the U.S. Army 
Research Office. 
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Estimating gradient trees 

Ming- Yen Cheng*, Peter HalP and John A. Hartigan* 

National Taiwan University, Australian National University and Yale University 

Abstract: With applications to cluster analysis in mind, we suggest new ap- 
proaches to constructing tree diagrams that describe associations among points 
in a scattcrplot. Our most basic tree diagram results in two data points being 
associated with one another if and only if their respective curves of steep- 
est ascent up the density or intensity surface lead toward the same mode. 
The representation, in the sample space, of the set of steepest ascent curves 
corresponding to the data, is called the gradient tree. It has a regular, octopus- 
like structure, and is consistently estimated by its analogue computed from a 
nonparametric estimator which gives consistent estimation of both the den- 
sity surfeice and its derivatives. VVe also suggest ‘forests’, in which data are 
linkerl by line segments which represent good approximations to portions of 
the population gradient tree. A for<»t is closely related to a minimum span- 
ning tree, or MST, defined as the graph of minimum total length connecting 
all sanjple |>oints. However, forests ii.se a larger bandwidth for constructing the 
density-surface estimate than is implicit in the MST, with the result that they 
are substantially more orderly and are more readily interpreted. The effective 
bandwidth for the MST is so small that even the corresponding density-surface 
estimate, let alone its deri\'ativcs, is inconsistent. .'Vs a rc^sult, relationshi|)s that 
are suggested by the MST can change considerably if relatively small quan- 
tities of data are added or removed. Our trees and forests do not suffer from 
this problem. They are related to the concept of gradient traces, introrluced 
by Wegman, Carr and Luo (1993) and Wegman and Carr (1993) for purposes 
quite different from our own. 


1. Introduction 

Gradient trees capture topological features of nuiltivariate probability densities, 
such as modes and ridges. In this paper we suggest methods for estimating gradient 
trees basetl on a sample of n observations from the density. Each estimator is in the 
form of a tree with n — 1 linear links, connecting the observations. The methods 
will be evaluated in terms of their accuracy in estimating the population gradient 
tree, and their performance for real data sets. VVe also propose a new technique for 
describing, and pre^senting information about, neighbour relationships for spatial 
data. 

To define a gradient tree, note that the gradient curves of a multivariate den- 
sity / are the curves of steepest ascent up the surface S defined by y = f{x). The 
representations of gradient curves, in the sample space, will be called density as- 
cent lines, or DALs. The troo-liko st rijcture that they form is the gradient tree. This 
theoretical quantity may be estimated by replacing / by a nonparametric density 
estimator, / say, and then following the prescription for computing DALs and the 
gradient tree. 
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A gradient tree may be viewed as a modifiration the concept of a ‘gratlient trace’, 
introduced by Wegman, Carr and Luo (1993) and Wt'ginan and Carr (1993). The 
goal of these authors was to use gradient traces to compute ‘fc-skeletons’, which 
are Ar-dimensional analogues of the mode and represent nonlinear regression-like 
summary' statistics. Our purpose is quite different. We view griulient tre<w as a tool 
for cluster analysis, and argue that in this context the concept has advantages over 
more familiar methodology such as minimum spanning trees, or MSTs, introduced 
by Florek et al. (1951); see al.so Friedman and Rafsky (1981, 1983). 

An MST is the graph of minimum total letigth connecting all sample points. 
It is an estimator of the gradient tree that arises when we take / to lx* the most 
basic of nearest neighlxrur density ♦■stimators, in which the estimate at each ]x>int 
Ls inversely proportional to a monotone function of the dLstance to the closest 
sample point. However, this is a poor estimator of the population flensity, let alone 
its gradient, and so it is not surprising that the MST is a poor estimator of the 
corresponding population gradient tree. We suggest gradient tree estimators that 
are asymptotically consistent for the corresponding population gradient tree, and 
which also improve on the MST for small sample sizes. 

We also suggest algorithms for drawing ‘forests’, using either the full dataset 
or subsets that have been identified by the gradient tree. Like the MST, a forest 
provides information alxxit relationships among neighbouring data, but like our 
gradient tree it has the advantage that it is frased on relatively accurate, and statis- 
tically consistent, information about gradients. In contrast with the MST, a forest 
is baser! on directed line segments, with the direction corresixmding to movement 
up an estimate S of the surface S. Our approach to constructing a forest allows the 
experimenter to choose, when describing relationships between points, how much 
emphasis will be given to a relatively conventional Euclidean measure of closetiess 
of the points, and how much will lx> given to a measure of ck)sene8s related to 
movement up S. 

Although we work mainly in the bivariate case, our methods are certainly not 
limited to two dimensions. One way of treating high-dimeiLsional data is of course 
to form bivariate scatterplots by projection, and apply our methods to the iixlivid- 
ual plots. Tools for manipulating two-dimeiusional projections of three- or higher- 
dimensional data include Asimov’s (1985) grand tour, Tierney’s (1990) Lisp-Stat, or 
Swayne, Cook and Buja’s (1991) XGobi: see also Cook, Buja. Cabrera and Hurley's 
(1995) grand-tour projection-pursuit. 

Mortsiver, deiLsity ascent lines and gradient trees have analogues when the sam- 
ple siMce is of arbitrarily high dimension, rather than simply bivariate. (Analogues 
of forests may be constructed too, but the formula for a certain ix>nalty term that is 
neerhxi to define a forest is more complex in higher dimeiLsions.) Hence, rather than 
compute these quantities for bivariate scatterplots, their multivariate forms (rep- 
resenteri as lines in space, rather than lines in the plane) could be calculated and 
then viewerl through their bivariate projections, or through rotations of trivariate 
projections. 

DeiLsity-based approaches to assessing relationship have also lxx>n coiLsidered 
by Hartigan (1975), who t(X)k clusters to lx- maximal connected sets (that en- 
joyed at least a certain level of likelihood) of points of density exceeding a cer- 
tain level. See also the discussion of tree diagrams by Hartigan (1982). Alterna- 
tive approaches include methods based on measures of distance that satisfy the 
triangle inequality (e.g. Jardine and Sibson, 1971; Hubert. 1974) and techniques 
founded on parametric mixtures (e.g. Everitt, 1974; Kuiper and Fisher, 1975). 
Wishart (1969) was an early user of near neighbour methods to construct clusters. 
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Pruzansky, Tversky and Carroll (1982) compared spatial and tree representations 
of data. 

2. Gradient trees and ridges 

We begin by defining a ‘true’ density ascent line, when the density / of the bi- 
variate distribution of a general data point X is assumed known. Then we di.scu.ss 
computation of this line, and calculation of its sample version. 

Let S be the surface defined bj' the equation y = f{x), and assume that both 
the first derivatives of / are continuous everywhere. Suppose too that the set of 
positive density is connected, and contains at most a finite number of stationary 
points. A density ascent line (DAL) for /, starting at a point x in the plane fl that 
denotes the sample space, is defined to be the projection, into fl, of the trajectory 
formed by climbing S in the direction of steepest ascent. Henceforth we shall call 
the ‘projection’ of a three-dimensional structure into FI, the ‘representation’ of that 
structure in H, and reserve the term ‘projection’ for other purpose.s. 

If the trajectory on S is represented as the locus of points (x^^)(.s), a:(2)(.s),y(.s)), 
where s 6 (0, so) >s a convenient parameter such as distance along the trajectory 
from one of its ends, then the corresponding DAL will be the curve formed by the 
locus of points (x^^^(s),a:^^^(s)), for s 6 (0,so), in 11. If /i ,/2 denote the derivatives 
of / in the two coordinate directions then the curve of steepest ascent is in the 
direction (/i,/ 2 ), and is well defined except at stationary points of the density. The 
gradient tree is the collection of closures of DALs. 

Next we give more detail about a DAL, and then an explicit method for com- 
puting one. Let D{f) = {ff + denote the length of V/ = (/i,/ 2 ), and put 

u!j = fj/D{f) and u; = (u;i,u; 2 ). Then, for a; 6 «S, u;(a:) is the unit vector in II repre- 
.senting the direction of steepest ascent up «S, at the point {x,f{x)) € S. The DAL 
that passes through x 6 H may be thought of as having been obtained, starting 
at a point on the line, by stepping along the line in the direction indicated by u;. 
Formally, the DAL that passes through x € H may be represented by the infinitesi- 
mal transformation, x ► x -I- u;(x) ds, where ds is an element of displacement along 
the DAL, denoting the length of one of the aforementioned steps. 

This suggests the following algorithm for comi)utation. Given xo G H, and a 
small pasitive number d', consider the sequence of points V = {xj : —oc < j < oo} 
definwl by Xj = Xj_i +uj{xj-i)S and x-j = x\-j +uj{xi-j)S, for j > 1. Tluus, the 
DAL that ptis.ses through xo represents the limit, as <5 — ► 0, of the sequence V. The 
algorithm is convenient for numerical calculation, provided we stop before reaching 
places where D{f) vanishes. 

In empirical work, where we compute estimators of DALs, we of course replace 
/j/ij/z hi the algorithm by their estimators /, /i,/ 2 - VV’e used the algorithm de- 
scribetl above, with a suitably small value of S, to calculate the empirical DALs 
shown in Section 4. Alternatively, one could recognise that DALs are integral lines 
of the gradient field of a smooth density function, implying that in principle they 
could be computed using an ordinary differential equation .solver. 

There is no commonly accepted definition of a ridge (or antiridge) of a sur- 
face such as S, and in fact four different approaches, framed in terms of indices of 
‘ridgeness’, were suggested by Hall, Qian and Titterington (1992). The following 
is related to the second definition there, and is chosen partly for ease of compu- 
tation in the present context; its representation in H is easily calculated from the 
functional D{f). Moreover, the representation is it.self a DAL, and it admits an 
elementary (and computable) generalisation to high-dimensional .settings. 
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Given a point P on S, let II' = II'(P) denote the plane that contains P and is 
parallel to II, and let C be the curve formed by the intersection of fl' with S. If the 
steepest ascent curve up S, starting from P, is perpendicular to C at P, then we say 
that P Ls a point on a ridge (or an antiridge) of S. The ridge or antiridge itself is a 
locus of such points, and is the curve of steepest ascent on S that pas.scs through P. 
(Therefore, its representation in If is a DAL.) The point P is on a ridge, rather 
than an antiridge, if the curvature at P of the curve formed by the intersection of S 
with a plane perpendicular to If. and containing P, is negative; and on an antiridge 
if the curvature is positive. 

A ridge can bifurcate at a point which represents a location on S where three 
or more ridges join. The trajectories of steepest a.scent that climb up the surface 
between two ridges meeting at a bifurcation point B, necessarily join one another 
at B. FVom there they have a common path, along an ascending ridge that leads 
away from B; and they continue together until they terminate at a local maximum, 
perhaps passing through other bifurcation points on the way. 

The representation, in the plane If, of a ridge and a bifurcation point will be 
called a ridge line (RL) and a branchpoint, resirectively. The DALs corresponding 
to the representations (in II) of ridges have different paths until they meet their 
first branchpoint, after which they are the same until they terminate at a mode. An 
RL is es.sentially what Wegman and Carr (1993) call 1-skeleton, the main difference 
being in the definition of a ridge. 

Therefore, the DALs that comprise a gradient tree do have a tree-like strvicture, 
in the following way. Individual points in the sample space, representing leaves of 
the tree, are at first linked to branchpoints through distinct DAL paths. Beyond 
the first branchpoint the consolidated bundle of DAL paths, representing a branch 
of the tree, may be joined at subsequent branchpoints by other branches, until they 
finally reach a mode. 

In theory, more complex structures are also possible, for example when two 
branches lead away from a branchpoint and come together again at a mode or at 
another branchpoint. However, it is rare in practice for such features to occur in 
DALs computerl from data via non))arametric demsity estimators, and so we shall 
not consider them further here. 

Two points xi,X 2 € II that are linked to the same mode by a DAL, may be 
said to lie in the same cluster. Thus. DALs divide the plane into clusters. Ridge 
lines divide the sample space in a DERANGED different manner, in a sense or- 
thogonal to the division into clusters. They give neither a s\ibclassification nor 
a higher-level classification, but provide information of a different type, as fol- 
lows. 

If the ridge that produced an RL were almost horizontal, and lay between two 
local maxima of S, occurring at points imax.i and Xmax.Zi say, in II, then the points 
along that RL would have no clear allocation to the clusters corresponding to inuw.i 
and Xmnx. 2 - Therefore, the RL would represent a watershed in the division of the 
sample sjmee into cliLsters. On the other hand, a point that lay on either side of, 
and sufficiently close to, the RL would be more definitively allocated to just one of 
the chisters represented by imax.i and 

More generally, we might fairly say that points that lie on one side or other of an 
RL are less ambiguously associated with their corresponding mode, at least if they 
are sufficiently close to the RL, than are points that lie directly on the RL. Indeed, 
if two points Xi,X 2 € H lie on opposite sides of, and sufficiently clo.se to, an RL, 
then all points x^ that lie between xi and x^ can be said to be more ambiguously 
associatisl with their corresponding modes than either or i 2 . 
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In addition to their role in defining sucli a gradation of the sample space, tlie 
fact tliat RLs of density or intensity estimators represent tlie ‘backbone’ and ‘ribs’ 
of the structure of those quantities moans tliat they provide valuable quantitative 
information about structure. Indeed, they are .sometimes u.sed to approximate the 
locations of physical structures associated with scatterplots, for example positions 
of the subterranean fault lines that give rise to earthquake epicentres (see Jones 
and Stewart, 1997). 

Relative to ridge lines, antiridge lines have more connection with clustering in 
the usual sense, since they represent boundaries betwi-en regions where points are 
assigned to different clusters. However, they are typically computed from relatively 
little data, and so their locations may not be known as pn>cisely as those of ridge 
lines. 

Next we describe a method for locating, and computing, an Ri., given the den- 
sity /. A locus of points on S, all of which have the same height above H, is called a 
level set of S. Its representation in H Ls a contour of S. An IfL may be reached from 
another point in H by moving around a contour. The orientation of the contour 
passing through x is the direction of the unit vector u)pprp(i), say, defined as being 
orthogonal to u)(x) and determined up to a change of sign. Therefore, the contour 
is defined by the infinitesimal transformation 1 1-* a" ± u,'p„rp(i) dn, where ds is an 
infinitesimal unit of length around the contour. The point at which this contour 
cuts an RL is a local minimum of D(f)\ a local maximum corresponds to cutting 
the representation in H of an antiridge. 

Hence, to find a i>oint x on an RL we move around the contour, computing £>(/) 
as we go, until we find a local minimum of D{f). Then, moving along the RL is 
equivalent to moving up the DAL starting from x, or down the DAL leading to x; 
we have already described how this may be done. It is helpful to note that turning 
points of D(J) are solutions of the equation 

/i2 (/? - /I) = f\h (/ii - /22) 1 

where Jij(x) = d^f(xi,X2)/dx,dxj. Of course, descending the DAL that defines a 
ridge is equivalent to traversing the line defined by x >— * x — o>(x) ds, where now ds 
is an infinitesimal unit of length along the DAL. 

More generally, if the sample space II is p-dimensional, where p > 2; and if 
we define D = where ft equals the derivative of / in the direction 

of the ith coordinate direction, for 1 < f < p; then a ridge line or antiridge line 
is a locus in H of turning points of D(/). It may be calculated by generalising 
the method suggested above. DEFANGED A practicable, computational algorithm 
for an RL may be obtained as before, replacing the infinitesimal ds by a small 
positive number 6. The empirical version, in which density / is replaced by the 
density estimator /, also follows as before; we used this method to compute the 
RLs shown in Section 4. Tests for .significance of empirical modes may l>e basrxl 
on work of Silverman (1981), Hartigan and Hartigan (1985), Miiller and Sawitzki 
(1991) or Cheng and Hall (1999), for example. 

3. Forests based on distance and density 

^Vhile the minimum spanning tree is not consistent for the population gradient 
tree, it provides some information about relationshiijs among neighbouring data 
values. In this section we suggest a regularisation of the minimum spanning tree 
in which links between observations are penalised if they are not sufficiently close 
to estimated density ascent lines. It may be applied to a subset y = {K|, . . . , Vjv} 
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of the sample X = { Xi, . . . , X„}, for example to those data that are linked to the 
same mode in the gradient tree, as well as to the full sample. 

Let ||Vi - Kjll denote Euclidean distance in the sample space fl, and let d(Yi,Yj) 
be .some other measure of distance between Vj and Yj. It is not necessary that 
d(-,-) be a metric: appropriate definitions of d are jx)wers of Euclidean distance in 
n, i.e. d(Yi,Yj) = HVi — Yj\\’, and powers of Euclidean distance on S, i.e. 

d{Y„Y,) = [\\Y, - Yif + {m) - hyi)?]’"" . 
where s > 0. In our numerical work in Section 4 we shall use the first of these 
definitions, with ,s = 2. 

Now add a penalty to d(Yi,Yj}, proportional to the squared length of the pro- 
jection of Yi — Yj orthogonal to tD(yi). (Here, ui(z) denotes the empirical form of 

u;(x), computecl with / replacing /.) Equivalently, the penalty is proportional to 
the area of the triangle that has one side equal to the length of the line joining Yi 
and Yj , and another equal to the length of the representation in II of a straight-line 
approximation, of the same length as the previous side, to the gradient curve. The 
area in question erjuals half the value of HVi — Ijp — {(Tj — Yj) ■ if the 

vertex of the triangle is at V;. We apply these penalties in proportion to a tuning 
parameter f > 0, obtaining symmetrically and asymmetrically |>enalised versions, 
res|>ectively, of d(Yi,Yj): 

D(Yi,Yj) = d(ToV}) + l[||yi-yif-{(K.-V'^) or (3.1) 

D(Yi,Yj) = + 

+ t [IIV'. - V'lll" - {(y. - yj) ■ w(V'i)}'"]- (3.2) 

Using a large value of t amounts to placing more emphasis on point pairs whose 
interconnecting line segment lies close to a gradient curve. 

We are now in a position to construct the forest corresponding to the dataset y 
and the penalised distance measure D. Given Yi, we draw a directed line .segment 
from Vi to Yj if and only if Yj minimises D(Yi,Yj) over all points Yj for which 
f{Yj) > /(Vi). The forest is the set of these directed segments. If iV is a cluster, 
and if we adjoin to y the unique mode associated with that structure, then with 
probability 1 there is exactly one point Vi (the mode) in 3^ for which the directed 
line segment does not exist. As we climb higher up the surface the directed line 
segments tend to coalesce, producing a tree structure .sprouting from the mode 
(althottgh it was constructed from the opposite direction). 

If we define D{-, ■) as at (3.1) then taking t — 0 produces a forest that is similar 
in l)oth definition and appearance to the minimum spanning tree, although based 
on directetl litie segments. Choosing a relatively large value of t imposes greater 
penalty for not walking as nearly as possible along the DAL that starts at V), 
when passing from V) to Yj. The extent to which line segments cross over in the 
fore^ may be recluced by increasing t, thereby forcing the direction of movement 
on S to give more emphasis to the uphill component of motion. The advantage 
of (3.2) over (3.1) is that in the former the tree treats the notions of ‘uphill’ and 
‘downhill’ symmetrically, but in practice, forests defined by (3.1) and (3.2) are 
virtually identical. 

4. Numerical examples 

Rees (1993) determined the ‘proper motions’ of 515 stars in the region of the glob- 
ular cluster M5. Using the proper motions and radial velocity dispersioas he esti- 
mated the probability that each star belonged to the cluster. The analysis below is 
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Figure 1: Steepest Ascent Trees. Panels (a), (b) and (c) depict DALs for the 
smoothed nearest neighlxmr estimator corresponding to k = 25, 50, 100, respec- 
tively. 


based on the Herzprtmg-Russell diagram, a plot of magnitude versus temperature, 
for the 463 stars that were determined by Rees to have probability of at least 0.99 
of belonging to the chuster. 

We employed two different versions of /. Both were nearest neighbour methods, 
which we cho.se for reasons that were both pragmatic (the adaptivity of NN methods 
means that they have less tendency than other density estimation techniques to 
suffer from spurious islands of mass) and didactic (NN methods are commonly 
used in classification problems). The first version of / was a standard A’’th nearest 
neighbour estimator, with f{x) equal to k/{m:r'^) where r = r(x) was the smallest 
number such that the circle centred on x and with radius r contained just k points. 
The second density estimator was a smoothed version of the first, equal to 2k/ {nnr^) 
where r was the solution of 




See Section 5 for discussion of this technique. Since our graphs remain unchanged 
if we multiply / by a constant factor then it is not necessary to normalise, and so 
the factor k/nn may be dropped. 

Figure 1 depicts the gradient tree, or collection of DALs, for k = 25, 50, 100. In 
constructing figures 1 and 2 we used only the .second, smoothed nearest neighbour 
estimator /. Note that as k increases the number of empirical modes decreases; the 
number is 7, 4, 2 for k = 25, 50, 100 respectively. The gradient trees indicate which 
points are most closely associated with the respective modes. The orientations and 
spacings of the tentacles of these ‘octopus diagrams’ provide information about the 
steepness of / in different places. 

Figure 2 shows the RLs for the same values of k. Ridge lines are depicted by 
solid lines, and antiridge lines by dashed lines. The main RL, in the lower right of 
the figure, is clearly depicted; it is in a sense the backbone of the surface defined by 
the density estimator. Other RLs represent relatively minor ‘creases’ in the surface, 
and play more the role of ‘ribs’. 

The gradient trees provide only minimal information about interpoint relation- 
ships. Detail of that type is more readily supplied by forests, depicted in figures 3 
and 4 for the two respective density estimators. We used the distance function de- 
finerl at (3.1), with d{Yi,Yj) = (IVi — The six panels in each figure represent 
different pairs of values of the smoothing parameter k = 25,50, 100 and gradient 
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Figure 2: Ridge Projectxons. Panels (a), (b) and (c) show the ridge lines (solid) 
and antiridge lines (dashed) corresponding to the respective DALs in figwe 1. To 
illustrate relationships to the data, a scatterplot of the data is included in each 
panel. 






Figure 3; Forests. Forests drawn using the unsmoothed nearest neighbour estimator, 
with t = 0 (top row) and f = 10 (bottom row), and k = 25. 50. 100 (columns 1-3). 
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Figure 4: Forests. Forests drawn using the sinootlied nearest neiglibour estimator, 
witli panels ordered as in figure 3. 


weight / = 0, 10. Taking t = 0 produces directed line segments based almost entirely 
on distances between points, except that the direction of the segment is always that 
of increasing estimated density. The resulting forest is comparable to the minimum 
spanning tree, and its links have almost random orientation. On the other hand, 
using < = 10 gives heavy weight to segments that lie close to the representation 
in n of the estimated gradient curve, and (for both density estimators) produces a 
more orderly presentation of the links. 

Overall, the data show strong evidence of a northwest to sotitheast ridge, and at 
least three modes. Smoothing the density estimator produces some regularisation 
of forests, but choice of k has much greater effect on our graplis than estimator 
type. 

In ortler to further illustrate performance of the gradient tree approach, these 
methods, along with two conventional graphical tools (contour plots and perspective 
mesh plots), were applied to two simulated data sets. In these examijles, which are 
discus.sed below, smoothed nearest neighbour estimators were employed whenever 
estimation of the density and its gradients were required. 

In the first example, 500 random v'ariates were generated from the bimodal 
Normal mixture. 

The smoothing parameter was k = 45, and gradient weight was t = 10. The data, 
contour plots, and perspective mesh plots ba-sed on the density estimator, are shown 
in panels (a) and (b) of figure 5, which provide evidence of bimodality. However, 
the density ascent lines, ridge lines and forests, depicted in patiels (c), (d) and (e) 
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Figure 5; Bimodal data example. A scatterplot of 5(X) random numbers simulated 
from model (4.1) is shown in panel (a). Panels (a), (b), (c), (d) and (e) depict 
respectively contour plots, a perspective mesh plot, density ascent lines, ridge lines, 
and forests t>a,se<l on the smootluxl nearest neighbour estimator with k = 45 and 
t = 10. 


of figure 5, show more clearly than panels (a) and (b) structure of the sinface, and 
in particular the locations of the two modes and the steepest ascent directions up 
the surface. 

Each of the graphical methods illustrated in panels (c) and (d) divides the 500 
data points into two subgroups, in which each point is connected to the centre 
of the subgroup to which it belongs. The directions of the density a.scent curves, 
and hence information about the way in wliich the surface increases as one moves 
in different directions, are conveyed much better by these two graphics than by 
those in panels (a) and (b). Most importantly, patieLs (c) and (d) allow the reader 
to extract point-to-point relationships from the data to a significant extent; such 
information cannot be so readily obtained from the contour plot (panel (a)) or the 
perspective mesh plot (panel (b)). 

The second example is of data simulated from a model, described below, which 
has more complex structure than that described at (4.1). Let U, V, IV, Z be indepen- 
dent random variables, with U and V having the TV (0,0.06'*) distribution, IV being 
uniformly distributed on the interval (—1, 1), and Z having density g{z) = 0.22 4-0.5 
for | 2 | < 1. Put 

X = sgti(lV)(0.6-2)/(-l<Z<0.6)-l-f7, Y = Z + V, (4.2) 

where /(•) denotes the indicator function. The surface defined by the joint density 
of (X, Y) has two ridges, represented by the lines x = ±(0.6 — t/) for — 1 < y < 0.6, 
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Figure 6: Ridge data example. A scatterplot of 500 observations simulated from 
model (4.2), and the corresponding contour plots, are shown in panel (a). Perspec- 
tive mesh plots from different angles, showing the two ridge branches, are given 
in panels (b) and (c). Panels (d), (e) and (f) depict respectively density ascent 
lines, ridge lines, and forests. Graphics here uscxl the smoothed nearest neighbour 
estimator with k = 55 and t = 10. 

which merge at (0,0.6) and then continue together along the line t = 0 until the 
point (0, 1) is reached. The height of the surface increases steadily as one travels 
along any of these ridges in a direction that has a northbound component. 

We generated 500 observations from model at (4.2). The smexithing jjariuneter 
was taken to be = 55, and the gradient weight was t = 10. Panel (a) of figure C 
incorporates a scatterplot of the dataset. The contour plots and perspective mesh 
plots, given in panels (a) (c) of figure 6. provide only a vague impression of the bi- 
ridge nature of the data. In contrast, the density ascent lines, ridge and autiridge 
lines and forests, shown in i>anels (d)--(f) of figure 6, provide sultstantially Its® 
ambiguoas information about the ridges and, more generally, about the nature of 
the scatterplot. 

The tree and forest structures in different datasets, for example those in our 
last two examples, are readily compared. In particular the very different characters 
of the ‘octopus plots’ (tree structures made up of deasity ascent lines) in panel (c) 
of figure 5, and panel (d) of figure 6, are immediately apparent. The first shows 
two approximately symmetric clasters about single centres, with little evidence of 
ridges, while the second demonstrates marked asymmetry and ‘ridginess’. Likewise, 
the forests in panel (d) of figure 5, and panel (f) of figure 6, show very different 
hierarchical structures. The first demoastrates a relatively low level of relationship 
among different [x>ints in the claster, with many of the branches of the forest joining 
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the cluster relatively clase to the respective mode, and so being related to other 
branches (and hence other points in the cluster) largely through that mode. On 
the other hand, panel (f) of figure 6 shows a strong degree of hierarchy, with each 
branch of the forest joining its respective ‘ridge branch’ after travelling only a short 
distance, and being linked to other branches though the ridge. 

5. Density estimators and theory 

The two-dimensional nearest-neighbour density estimators used in Section 4 may 
be described as follows. Given a kernel K, put f{x) = f{x\R) = R/nh\ where hx 
is given by 


If K is the uniform kernel, e<|ual to l/ir within a region R and 0 elsewhere, then 
tliis prescription requires hx to be sucli that R data values are contained within 
the region x ® hzR. which of comrse is the standard near-neighbour construction. 
A disadvantage of the uniform kernel, however, is that the resulting estimator is 
very rough. The second approach discussed in Section 4 iLses a bivariate form of 
the Epanechnikov kernel. Alternatively we could use bivariate biweight or triweight 
kernels. 

We employed the same value of R for all i, so that the bandwidth hx was 
relatively small in regions of high data density. Assuming that R = R{n) — > oo 
and R/n — * 0 as n — • oc it may be shown that hx ~ {fl/n«i/(x)}'/* as n — * oo, 
where Hj = J A'(u)t dv. In particular, the effective bandwidth is of size (A/n)’^*. 
A.ssuming that K is symmetric and / has two bounded derivatives, the bias and 
variance of / are of sizes R/n and {n/R^Y^^, respectively. Therefore, optimal mean- 
square performance of the estimator / is obtained with R of size in which ca.se 

mean squared error equals 0(n~*^^), just as it would be for a traditional .second- 
order kernel estimator. Variance is asymptotic to {nf^K^/ 

Note particularly that, using bandwidths of these sizes, our gradient estimators 
are consistent for the true gradients. That is not true for the implicit gradient 
estimators employed in a minimum spanning tree, which are in effect ba.sed on a 
bandwidth that is of size This means that the error-about-the-mean term in 

the estimator of /, let alone for estimators of the derivatives of /, does not converge 
to zero, which accounts for the haphazard, complex structure of minimum spanning 
tree <liagrams. 
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Abstract: With very large sample sizes the conventional calculations for tests 
of the equality of two probabilities can lead to very small P-values. In those 
cases, the large deviation effects make inappropriate the asymptotic normal- 
ity approximations on which those calculations are base<l. While reasonable 
interpretations of the data would tend to reject the hypothesis in those cases, 
it is desireable to have conservative estimates which don’t underestimate the 
P-value. The calculation of such estimates is presented here. 

1. Introduction 

There are several excellent alternatives for testing the hypothesis that pi = P 2 
where pi and p -2 are probabilities governing two binomial samples. These include 
the Yates continuity correction and the Fisher Exact test and .several others based 
on the asymptotic normality of the observed proportions. All these test procedures 
have the desireable property that the calculated P-value does not depend on the 
unknown common probability under the hypothesis. There is a slight problem with 
the Fisher exact test, i.e., it is not strictly appropriate for the problem because the 
calculated probability is conditional on the values of the margins, which are not 
fixed in advance. The problem is considered slight because the information in the 
margins is quite small Chernoff (2004). 

In a legal case the problem arose where there were 7 successes out of 16 trials 
for one sample iind 24 successes out of 246 in the second sample. It is clear that 
the hypothesis is not plausible in the light of these data. Since the various alterna- 
tive tests provide substantially different calculated P-values, all very small, it was 
considered wise to present a very conservative P-value. While one sample size was 
substantial, the other was quite modest. Neither was so huge that modern com- 
puters would be frustrated by calculating the exact P-value rather than relying on 
asymptotic theory. One consequence of such an approach is that the P-value is no 
longer independent of the unknown value of the nuisance parameter, the common 
value of the probabiities under the hypothesis. This problem is dealt with in several 
publications (Berger and Boos (1994), Chernoff (2003)). A crucial aspect of the dif- 
ficulty in using asymptotic theory is that in extreme cases where the P-values are 
very very small, we are in the tails of the distribution and asymptotic normality no 
longer fits in these large deviation cases. 

A new problem recently came to my attention, where both .sample sizes are 
enormous, i.e. ni = 19,479 and ri 2 = 285,422, Here agains there are several cases 
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where we have a large deviation problem, and asymptotic normality is not appro- 
priate, and probably not conservative. How should we deal with this problem in 
this example where ordinary high speed computers may find it difficult to provide 
exact calculations such as were feasible in the previous case? The Chernotf bound, 
originally derived by H. Rubin, provides a method of deriving an upper bound on 
the desired probability which is convenient to calculate. 

2. The Poisson approximation 

While the normal approximation is unreliable, the Poisson approximation may be 
better. In any case, it is to be used here merely to provide an initial approximation 
for the quantities required for the binomial calculation. We outline the analysis 
which provides a solution assuming the Poisson approximation fits. 

The main tool to deliver a conservative bound on the P-ralue is the Chernoff 
bound, first deriverl by Herman Rubin, using a Chebyshev type of inequality, that 
states that if d > E{X), 

P(X >d)< 

for all t. The right hand side attains its minimum for ( > 0. 

Let Xj and X 2 lie the numher of sncces.ses in ni atid H 2 indepemlent trials with 
common probability p, and let 

n, ri2 

Using the Poisson approximation to the binomial distribution, we shall derive 
the curve in the (p,d) space for which the bound on log(P(D > d)), 

q = log(inff;(e'<®'‘'>)) 

attains a given value, for d > 0. Under the assumption that the numlM*r of successes 
in each trial has a Poisson distribution, we have 

Q(t,d) = log(E(c'<®-‘'>)) = -dt -t- n,p(e'/'“ - l) + ri 2 p(e-'/”» - l). 

Differentiating with respect to t. the value of t which minimizes Q satisfies 

e</n. =d/p = a 

while 

Q(t,d) =pr((,a) 

where 

r(f, a) = —at + n \ — l) -t- — 1 ) . 

For each value of t, there is a corresjjonding value of a for wliicli t is optimal 
and a corresponding value of r. Let p = qjr and d = op. As t varies these values of 
p and d trace out the (p, d) curve corresponding to the given value of 9 > log(P). 

3. The binomial case 

We use the Pois.son calculation to get a first approximation in the derivation of the 
(p.d) curves for the binomial case. In the previous section we obtained values of p 
and d for each value of t. Here we will keep both p and q fixed, and starting with 
the value of f, we find 

Q(t,d) = log = —td + ni log(l —p + pe*^"') - 1 - 02 log(l — p + pe~‘^"^) 
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ami the value of d for wliich Q is minimized by the given value of < is given by 



We note tliat 


(«) =p(l -p)^ 



Insofar as Q{t,d{t)) varies from the specified value of q, we apply the Newton 
iteration to modify t. Tliis leads t to the new value t + (q — Q{t,d{t))/Q'(t) w'here 


Thus t goes into t — {q — Q)/td'{t). 

If the new value of t and d(t) do not provide Q(t, d(t)) close enough to the desired 
ralue q, one may iterate again. Finally we have for each initial value of t and the 
given value of q a new point (p, d) for the curve of specified q > log(P(D > d)). 

While the curves we have obtained of (p,d) values for a given value of q are 
useful, they don’t resolve the inverse problem in which we may be interested. That 
is, how do we calculate a bound on the P-value for a given p and d? A .series of 
curves provided above would be useful to get rough approximations for a set of cases 
with given »ii and ri 2 , but do not provide a reasonable precise algorithm should that 
1 m‘ desired. To obtain the bound on the P-value. we start with the estimate of p 
given by p = (Xi -I- X 2 )/(n\ -f- nj). Assitsming that value is fixed, we approximate 
t, a.ssuming t is small compared to nj and ri 2 , by 


This value of t together with the observed value of D yields Q{t, D) and d{<). 
Imsofar as d(t) differs from D, we modify t by the Newton method to f + (D — 
d(<))/d'(<). VVMth this new ralue of t, we recalculate Q and d{t) and interate until 
d(<) is approximately D. Then the bound on the P-value is given by assuming 
otir {"stimate of p is accurate. Since the range of po.ssible vahusj of p is quite limited 
under the hypothesis, we can see how much the I’-value changes by considering 
|X)tential alternative values of p. 

4. Summary 

For the case of very large sample sizes, with data quite inconsistent with the hy- 
pothesis that two binomial distributions have the .same value of p, we anticipate 
very small P-values. The usual calculations are unreliable because large deviation 
effects make the asymptotic normality on which these calculations depend unreli- 
able. While it is clear in such ca.ses that the hypothesis is false, it Is often desireable 
to have a conservative txjund on the P-value. The Chernoff bound provides such a 
result. We provide the basis for three algorithms. One provides the (p, d) values for 
which given bounds on the value of log(P) are attained assuming that a Poisson 
approximation to the binomial distribution is acceptable. This algorithm is u.s«l as 
a starting point in calculating the curve of (p,d) values for the binomial dLstribu- 
tion. Finally we show how' to calculate the conservative bound for the P-t-alue in 
the binomial case. 


Q'(t) = OQ/dt -h d’(t)dQ/ad = -td'{t). 


dnifi2(l — p) 

(m + r»2)p 
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We have in Figure 1, the (p, d) values for the case P = 10~“ where a takes 
on the values 3,4,5,8>12,16, and 20, rii = 19,479, ri 2 = 285,422, mid we use the 
binomial distribution. In Figure 2 we use the calculation for the Yates continuity 
correction where p represents the estimate of the common probability. 

In both of these cases we have calculated one sided P- values. The calculation for 
negative values of D can be obtained by interchanging ni and ri 2 after replacing D 
by its absolute value. 
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Abstract: (Consider an imaging situation with extremely high noise levels, hid- 
den in the noise there may or may not be a signal; the signal — when present — is 
so faint that it cannot be reliably detected from a single frame of imagery. Sup- 
jx»se now multiple frames of imagery are available. Within each frame, there is 
only one pixel possibly containing a signal while all other pixels contain purely 
Gaussian noise; in addition, the position of the signal moves around randomly 
from frame to frame. Our goal is to study how to reliably detect the existence 
of the signal by combining all different frames together, or by “multiple looks”. 

In other words, we are considering the following testing problem; test 
whether all normal means are zeros versus the alternative that one normal 
mean per frame is non-zero. We identifier! an interesting range of cases in 
which either the number of frames or the contrast size of the signal is not large 
enough, so that the alternative hypothesis exhibits little noticeable effect on 
the bulk of the tests or for the few most highly significant tests. With care- 
ful calibration, we carried out detailed study of the log-likelihood ratio for a 
preci.sely-specified alternative. We found that there is a threshold effect for the 
above detection problem: for a given amplitude of the signal, there is a critical 
value for the number of frames — the detection boundary — above which it is 
possible to detect the presence of the signals, and below which it is impos- 
sible to reliably detect it. The detection boundary is explicitly specified and 
graphed. 

In addition, we show that above the detet;tion boundary, the likelihood 
ratio test would succeed by simply accepting the alternative when the log- 
likelihood ratio exceeds 0. We also show that the newly proposed Higher Crit- 
icism statistic in (11) is successful throughout the same region of number (of 
frames) vs. amplitude where the likelihood ratio test would succeed. Since 
Higher Criticism does not re<|uire a specification of the alternative, this implies 
that Higher Criticism is in a sense optimally adaptive for the above detection 
problem. The phenomenon found for the Gaussian setting above also exists for 
several non-Gaussian settings. 


1. Introduction 

Consider a situation in which many extremely noisy images are available. In each 
image frame, there is only one pixel containing a signal with all other pixels con- 
taining purely Gaussian noise. For any single frame, the signal is so faint that it 
is impossible to detect, and in addition, the position of the signal moves around 
randomly from frame to frame. The goal is to study how to detect a signal hidden 
in the extremely noisy background by combining all different frames together; i.e. 
by “multiple looks”. This is a mathematical caricature of situations faced in two 
applied problems. 

1. Speckle Astronomy. In earth-based telescope imaging of astronomical objects, 
atmospheric turbulence poses a fundamental obstacle. The image of the object 
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is constantly moving around in the field of view; with a regular exposure time, 
an image of what should be a sharp point becomes highly blurred. A possible 
approach is to take many pictures with very short exposure time for each 
picture; the exposure time is so short that during exposure the position of 
the object hardly changes. However, this causes a new problem: the exposure 
time being so short that few photons accumulate, therefore we are unable to 
clearly see the object in any single frame. Technology nowadays enables us to 
easily collect bimdreds or thousands of frames of pictures; from one frame to 
another, the position of the galaxy/star (if it exists) randomly moves around 
within the frame. The goal is to find out roughly at what amplitude it becomes 
possible to tell, from m realizations, that there is something present above 
usual background, see [2]. In this example, we are trying to detect, but not 
to estimate. 

2. Single Particle Electron Microscopy (SPEMj. In traditional crystallography, 
the image taken is actually the superposition of the scattering intensity across 
a huge number (10**) of fundamental cells of the crystal, the .superjrosed im- 
age lat:ks phase, and can only resolve the modulus of the Fourier TraiLsform 
(FT) of the image. However we need to see images with phase correctly re- 
solved. A possible solution to this is the single particle EM, see [25]. This 
method enables us to see correctly phased image from a single surface patch 
of frozen non-crystal lized specimen; however this caused a new problem: the 
image is extremely noisy, there is little chance to see the molecule from any 
single image. On the other hand, technology nowadays can easily take large 
numbers (10*®) of different frames of image; however from one frame to the 
another, the position of the molecule randomly moves around the whole frame. 
However, by combining these huge numbers of frames of images, we hope we 
cun reliably estimate the shape of the molecule. The question here is: what 
are the fundamental limits of resolution? If we can’t “see" the molecule in 
any one image, and the molecule is moving around, can we still recover the 
image? In this example, the question is to estimate: however the first step for 
estimation is to make sure the things you want to estimate are actually there, 
and .so problem of detection is an essential first step. 

1.1. The multiple-looks model 

Motivated by the examples in the previous section, suppose that we have indepen- 
dent observations xj**, l<j<n,\<k<m (we reserve i for \/^), here j is the 
index for different pixels in each frame, and k is the index for different frames. As 
we have m frames and n pixels per frame, we have in total N observations, where 

N sm-n. (1-1) 

For simplicity, assume that the signal, if it exists, is contained in one pixel for 
each frame. We want to tell which of the following two cases is true: whether each 
frame contains purely Gaussian noise, or that exactly one pixel per frame contains 
a signal (of fixed amplitude) but all other pixels are purely Gau.ssian noise and that 
the position of the signal randomly changes from frame to frame. 

Formally, the observations obey: 

-Vj** =/i<5j„(fc)(j) + 2]'‘’, l<j<n, l<Ir<m. (1.2) 
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where 

^(k) i.^d 

/j is the amplitude of the signal, and jo(^') is the position of the signal. Here for 
any fixed k, jo{k) is random variable taking values in {1,2, . . . ,n} with equal ijrolv 
ability, independent with each other as well as and where <5j„(fc)(-) is the Dirac 
sequence: 


^jo(k)iJ) 



j = jo(k), 
j ¥=ja{k). 


(1.3) 


The problem is to find out: given ft and n, what's the minimum value of ni = m’ 
such that we are able reliably to distinguish (1.2) from the pure noise model xj** = 




Translating our problem into precise terms. wc> are trying to hypothesis test the 
following: 


Ho : A'j** = z'*’, 1 < j < n, 1 < *: < m, (1.4) 

= /<c5j„,*)(j) + zf ', 1 < j < n, 1 < fc < m, (l.,5) 

we call this testing model as mulliple-looks model. Here, Ho denotes the global 

intersection null hypothesis, and f/j" ”'* denotes a specific element in its complex 

ment. Under for each fixeed k, there is only one observation among 

containing a signal with amplitude /i, and the index io(k) is sampled 
from the set {1,2, ...,tt} with equal probability, independently with k as well as 
zj.*'; in total, we have N obsc'rvations which are normally distributed with zero 
mean, except m of them have a common nonzero mean p. 

Suppose we let m = n’’ for some exponent 0 < r < 1 (or cxpiimlently rn = 
7V/( >+'■)). For r in this range, the number of nonzero ineams is too small to be 
noticeable in any sum which is in expectation of order N, so it cannot noticeably 
affcH;t the behavior of bulk of the clLstribution, Let 


fi = = y^2.s log n, 0 < ,s < 1; (1.6) 

for .s in this range, /i„ < \/21ogr!. the nonzero means are. in ex|>ectation, smaller 
than the largest Xj*'* from the true null component hypotheses, so the nonzero 
means cannot have a visible effect on the upper extremes. For the calibrations we 
choose in this way, there is only a tiny fraction of observations with eletated mean, 
and the elevated mean is only of moderate significances. 


1.2. Log-likelihood ratio and limit taw 

Obviously, with p, ti, and m fixed and known, the optimal procedure is the Neyman- 
Pearson likelih(M)d ratio test (LRT), [28], The log-likelihexjd ratio statLstic for ))rob- 
lenLS (1.4) (1.5) is: 

m 

*=1 

where for any I < k < rn, 
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Fixed 0 < s < 1 and n large, when r « 0 is relatively small, as the overall evidence 
for the existence of the signal is very weak, the null hypothesis and the alternative 
hypothesis merge together, and it is not possible to separate them; but when r gets 
larger, say r « 1, the evidence for the existence of the signal will get strong enough 
so that the null and the alternative separate from each other completely. Between 
the stage of “not separable” and “completely separable” , there is a critical stage of 
“partly separable” ; a careful study of this critical stage is the key for studying the 
problem of hypothesis testing (1.4)-(1.5). 

In terms of log-likelihood ratio (LR), this particular critical stage of “partly 
separable” can be interpreted as, for any fixed s and = \/2s logn, there is a 
critical number m* = m*(n, s) such that as n — > oo, LR,i^„r converges weakly to 
non-degenerate distributions and under the null and the alternative respec- 
tively; since typically i/® and overlaj), the null and the alternative are partly 
separable. 

This turns out to be true. In fact, we have the following theorem: 

Theorem 1.1. For pamineter 0 < s < 1, let fin = fin,s = \/2s\ogn, and 

f 0 < s < 1/3, 

rn* = m*(n,s) = < . — ,, ,3... . 

then as n — > oc: 

1. When 0 < s < 

under Hq : LRn,m' N{-\/2, 1), 

under //j”''"’) :LRn,m* N{\/2, 1). 

2. When s = \, 

under Ho : A^(-l/4, 1/2), 

under //j'*-”**) : N(l/4, 1/2). 

3. When | < s < 1, 

under Hq : LR,x,m‘ ^si under ^ 

where t/” and i/] are distributions with characteristic functions and e^» 
respectively, and 

/ OO 

[ginog(i+e*) _ i-ite‘]e--^^dz, (1.7) 

■OO 

/ OO 

[e“‘”g(i+"')-l]e--^^*d^. (1.8) 

■OO 


In fact, the difference between LR„^„i» under //['’"* and LRn,nt- under Hq 
weakly converges to 1, 1/2, and according to s < 1/3, .s = 1/3 and s > 1/3, here 
u* is the distribution with characteristic function 

It wa.s shown in (26, Chapter 2] that the laws and in Theorem 1.1 are 
in fact infinitely divi.sible. In Section 6.3, we discmss several other i.ssues about 
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Figure 1: Left panel; Characteristic functions for (top) and (bottom). Left 
column: real parts, right column: imaginary parts. Right panel: Density functions 
for 1/^ (left) and i/'j (right). The mean values of them are approximately —2.09 and 
4.19, and variance of them are approximately 2.57 and 20 respectively. 


and 1 / 1 , where we view t/J as a special example of i/^^, and as a special example 
of i/l^y, with Ti = 2. In short, both j/J and have a bounded continuous den-sity 
function, and a finite first moment as well a finite .second moment. The mean value 
of i/J is negative, and the mean value of i/] is positive; in comparison. has a 
smaller variance than fj. In Figure 1, we plot the characteristic functions and 
density functions for and i/j respectively with s = 1/2. 

In [8], adapting to our notations, Burnashev and Begmatov studie<l the limiting 
behavior of L/Li.m with m = 1, see more discussion in Section 7.3, as well as 
Section 4. In addition, the LRT and its optimality has been widely studied, see [6, 
14], etc., and have also been discussed for various settings of detection of signals in 
a Gaussian noise setting, see [3, 4, 13], and also [29] for example. 


1.3. Detection boundary 


Theorem 1.1 implies that there is a threshold effect for the detection problem of 
(1.4)-(1.5). Dropping some lower order terms when necessary, (namely >/2^ • p„,, 
in the case 1/3 < s < 1), m* would be reduced into a clean form; rn’ = n'’’**', 
where 


p'(s) = 


1 - 2s, 


(l-sf 

4s 


0 < s < 1/3, 
1/3 <s< 1. 


(1.9) 


Consider the curve r = p*(s) in the s-r plane. The curve separates the square 
{(s, r):0 <s<l, 0 <r< 1} into two regions: the region above the curve or the 
detectable region, and the region below the curve or the undetectable region; we 
call r = p’(s) the detection boundary. See the left panel of Figure 4 for illustrations, 
also see the left panel of Figure 5, where the curve corresponds to 7 = 2 is r = p*(s). 
Theorem 1.1 implies that, roughly say, weakly converges to different non- 

degenerate distributions when (s, r) falls exactly on the detection boundary. We 
now study what will happen when (s, r) moves away from the detection bovmdary. 
On one hand, when (s, r) moves towards the interior of the detectable region, in 
comparison, we will have a lot more available observations while at the same time 
the amplitude is the same; so intuitively, LR„,„, will pvit almost all mass at —00 
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under the null, and at oo under the alternative; this implies that the null and alter- 
native separate from each other completely. On the other hand, when (s, r) moves 
towards the interior of the undetectable region, conversely, we have much fewer ob- 
servations than we need, so the null and the alternative would both concentrate their 
mass around 0; more subtle analysis in Section 4 gives a much stronger claim: by 
appropriate normalization. LR,^,m weakly converges to the same non-degenerated 
distribution, under Hg as well as under and this non-degenerate distribu- 

tion has a bounded continuous density function; thus the null and the alternative do 
completely merge together and are not separable. Precisely, we have the following 
Theorem. Recall that the Kolmogorov-Smirnov distance || • ||/is between any two 
cdf’s G and G' is defined as: 

||G-G'||^5 = sup|G(t)-G'(t)|; 

back to our notation m = ri’', here m depends only on n and r, which is not the 
critical m’ = m'(n,s) as in Theorem 1.1. 

Theorem 1.2. Let = p„,, = ^2s log n and m = n’’. 

1. When r > p"(s), consider the likelihood ratio test (LRT) that rejects Hg when 
LR„_m > 0, the sum of Type I and Type II errors tends to 0; 

Pho { Reject Hg } + { Accept Hg } -* 0, n -> oo. 

2. When r < p‘(s), 

^lm^l|Fr-'-F<"-">||,, = (). 

where and are the cdf’s of LRn.m under Hg and //*"’"* re- 

spectively. 4s a re, suit, for any test procedure, the .sum of Type I and Type II 
errors tends to 1; 

Ppalfieject Hg] -I- P^(„.„){Accept Hg} — • 1, ri — > oo. 

1.4- Higher criticism and optimal adaptivity 

If we think of the s - r plane, 0 < .s < 1, 0 < r < 1, we are saying that througho\it 
the region r > p*(s), the alternative can be detected reliably using the likelihood 
ratio test (LRT). Unfortunately, as discussed in [11], the usual (Neyman-Pearson) 
likelihood ratio requires a precise specification of s and r, and misspecification of 
(s, r) may lead to failure of the LRT. Naturally, in any practical situation we would 
like to have a procedure wliich does well throughout this whole region without 
knowUxlge of (s, r). Hartigan [18] and Bickel and Chernoff [7] have shown that the 
usual generalized likelihood ratio test max<,^{[dFi"*(£,/i)/(/P0”*](X)} has noiustan- 
dard behavior in this setting: in fact the maximized ratio tends to oo under Hg. It 
is not clear that this test can be relied on to detect subtle departures from Hg. Ing- 
ster [21] has proposed an alternative method of adaptive detection which maximizes 
the likelihood ratio over a finite but growing list of simple alternative hypotheses. 
By careful asymptotic analysis, he has in principle completely solved the problem 
of adaptive detection in the Gaussian mixture model (2.2) (2.3) which we will in- 
troduce in Section 2; however, this is a relatively complex and delicate procedure 
which is tightly tied to the narrowly-specified Gattssian mixture m<xlel (2.2)-(2.3). 
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It would t>e nice to have an easily-implemented and intuitive method of detection 
which is able to work effectively throughout the whole region 0 < s < l,r > p*(s), 
which is not tied to the narrow model (2.2)-(2.3), and which is in some sense eas- 
ily adapted to other (nonGaussian) mixture models. Motivateti by these, we have 
developed a new statistic Higher Criticism iti [11], where we have shown that the 
Higher Criticism statistic Is optimally adaptive for tletecting sparse Gaussian het- 
erogeneous mixtures, as well as many other non-Gaitssian settings. 

To apply the Higher Criticism in our situation, let us convert the observations 
into the p-values. Let = P{JV(0, 1) > -Vj**} be the p-value for observation 
and let the p(^) denote the p-values sorted in increasing order, (recall N = n- m): 

P{1) < P(2) < • • ■ < P(AT). 

so that under the intersection null hy|x>thesis the p^f^ Itehave like order statistics 
from a uniform distribution. With this notation, the Higher Criticism is; 

^ N “ P(o)/v^P(o(l -P(o)’ 

where 0 < oo < 1 is any constant. Under the null hyi>othesis Ho, HC^ is related to 
the normalized uniform empirical process. Intuitively, under Ho, the jvvalues p*** 
can be viewed as independent samples from C/(0, 1). Adapting to the notations of 
[11], let FnH) ~ ****''* uniform empirical process is denoted 

by; 

Ujv(t) = '/]v[FA,(t)-<l, ()<t<l, 

and the normalizetl uniform empirical process by 

Under Ho, for each fixed t, H’/v(f) is asymptotically A'(0, 1), and 


HC'f,, = maxo<,<o„HAr(f). 

See [11] for more discussion. The following theorem is proved in [11]; 

Theorem 1.3. Under the null hypothesis Ho, as N —> X, 

HCs . 

\/2 log log N ’’ 

It then follows if we threshold HC’f^ at v/-lloglogA', the Type I error would etpial 
to 0 asymptotically; moreover, thresholding at \/\ log log N also gives a Tyj)e 11 
error which equals to 0 asymptotically; 

Theorem 1.4. Consider the Higher Criticism test that rejects Ho when 

HC'f/ > v/41oglogA. (1.10) 

For every alternative //*" "*' defined in (1.4)-(1.5) above where r exceeds the de- 
tection boundary p'{s) — so that the likelihood ratio lest rejects Ho at 0 would have 
negligible sum of Type I and Type II errors - the test based on Higher Criticism 
in (1.10) also has negligible sum of Type I and Type II errors: 

[PuJHeject Ho} -I- Accept Wq}] -♦ 0, n -* oc. 


Copyrighted material 


262 


J. Jin 


Roughly speaking, everywhere in the s-r plane where the likelihood ratio test 
would completely separate the two hypotheses asymptotically — the Higher Crit- 
icism will also completely separate the two hypotheses asymptotically; since it 
doesn’t require any specification of parameters s and r, the Higher Criticism sta- 
tistic is in some sense optimally adaptive. Of course, in the cases where the s-r 
relation falls below the detection boundary, all methods fail. 

It is interesting to notice here the phenomena that the detection boundary 
r = p*(s) is partly linear (s < 1/3) and partly curved {s > 1/3); the curve only has 
uj) to the first order continuous derivatives at s = 1/3. As discussed in (11) or [26, 
Chapters 2-5], this phenomena implies that the detection problem of (1.4)-(1.5) is 
essentially different for the cases 0 < s < 1/3 and 1/3 < ,s < 1. Intuitively, when 
(s, r) is close to the curved part, statistics based on those a few largest observations 
would be able to effectively detect, while when (s,r) is close to the linear part, 
statistics based on a few largest observations (such as Max, Bonferroni, FDR) will 
fail, and only the newly proposed statistic Higher Criticism, or the Berk-Jones 
statistic which is asymptotically equivalent to the Higher Criticism in some sense 
[5, 11], is able to efficiently detect. As the study is similar to that in [11], we skip 
further discussion. However, in Section 2.2, we will explain this phenomenon from 
the angle of analysis. 

1.5. Summary 

We have considertxl a setting in which we have multiple frames of extremely noisy 
images, in each frame, hidden in the noise there may or may not be some signals, 
and the signal — when present — is too faint to be reliably detected from a single 
frame, and the position of the signal moves randomly across the whole frame. For 
fixed contrast size of the signal and the number of pixels in each frame, there is 
a critical number of frames — the detection boundary — above which combining all 
frames together gives a full power detection for the existence of the signal, and 
below which it is impossible to detect. 

Above the detection boundary, the Neyman- Pearson LRT gives a full power 
detection. However, to implement LRT requires a specification of the parameters, 
and misspecification of the parameters may lead to the failure of the LRT. Motivated 
by this, we proposed a non-parametric statistic Higher Criticism in [11], which 
doesn’t require such a specification of parameters; the Higher Criticism statistic 
gives asymptotically equal detection power to that of LRT. The Higher Criticism 
statistic only depends on p-values and can be used in many other .settings. 

Moreover, the detection boundary is partly linear and partly curvetl; compare 
the case when parameters are near the curved part and the case that the parameters 
are near the linear part, the detection problem is essentially different. Asymptoti- 
cally, for the first case, statistics based on the largest a few observations are able 
to efficient to detect; however, for the second case, such statistics will totally fail, 
but the Higher Criticism statistic is still able to efficiently detect. 

Below the detection boundary, asymptotically, all tests will completely fail for 
detection, even when all parameters are known. 

The approach developed here seems applicable to a wide range of .settings of 
non-Gaiussian noise. In Section 6, we extend the Gaussian noise setting to the 
Generalized Gaussian noLse .setting. 

1 . 6. Organization 

The remaing part of the paper is organized as follows. 
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Sections 2-3 are for the proof of Theorem 1.1. In Section 2, we introduce a 
GaiLssian mixture model, whicli we expect to be an “approximation” of the multiple- 
looks model, or Model 1.4- 1.5; in comparison, this Gaussian mixture model is easier 
to study, and thus provides a bridge for studying the multiple-looks model. We 
then validate this expectation in Section 3 by showing that, with carefully chosen 
parameters, the difference between the log- likelihood ratios of tlu^se two models are 
indeed negligible; Theorem 1.1 is the direct result of tho.se studies in Sections 2-3. 

Second, we prove Theorem 1.2 in Section 4, and Theorem 1.4 in Section 5. 

Next, in Section 6, we extend the .study in Section 2 on the Gaassian mixture 
to non-Gaussian settings. 

Finally, in Section 7, we briefly discuss several issues related to this paj>er. 
Section 8 is a technical Appendix. 


2. Gaussian mixture model, and its connection to multiple looks model 


Model (1.4)-(1.5) can be approximately translatetl into a Gau.ssian mixture model 
by “random shuffling”. In fact, recall that the observations are collected 

frame by frame; suppose we arrange the in a row according to the natural 

ordering: 



v(i) 
, ^2 


9 


v(i) yb”) v("0 

• • ' n ? • • ■ i 1 5-^2 ? ■ ■ ■ 5 ^ f I 1 


we then randomly shuffle them and rearrange back into franu^s, according to the 
ordering after the shuffling; we denote the resulting observations by : 1 < 

j < n, \ < k < 7u}. 

Of course under Hq, the above random shuffling won’t have any effect and the 
joint di.stribution of is the same as that of However, if is 

true, then xj*^ will have a slightly different distribution than that of Xj^\ which, 
approximately, can be viewed as sampled from a Gaussian mixture: 


Xj*-) ~ (1 -c)X(0,l)-|-eX(/i, 1), l<j<n, 1 < A' < m, (2.1) 

with 

c = = n~\ ft = /i„ = //„.« = v/2.slogn. 

The difference between {Xj^^} and {Xj*“^} is that under has exactly 

a fraction l/n of nonzero means in each frame while the {xj^^} has such a fraction 
only in expectation. Mor(x)ver, the problem of hypothesis testing the multii)le looks 
model (1.4)-(1.5) is approximately equivalent to hypotlu^sis testing: 


Ho : Xf'^ X(0, 1), 1 < j < n, 1 < A- < in, 

( 2 . 2 ) 

j_^(n,m) . t.^d _|_ (l//j)X(/i„, 1), 1 < J < U, 1 < A* < 111. 

(2.3) 


In this paper, we refer this model as the Gaussian mixture model, in contrast 
to the nmltiple-looks model. Since the random shuffling has no effect on the null 
hypothesis, we still u.se Ho to denote the null hypothesis; however, wo use Tf" '” 
to denote the new alternative hypothesis. Moreover, we denote the likelihood ratio 
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statistic of Model (2.2)-(2.3) by CTZn,m, in contrast to LRn,m of Model (1.4)-(1.5). 
Notice here: 


Xj‘> X<'> X 


r(m) 


>xi”") = '£'£cnf\ 

k=l j=l 


where 

= Cn{fi„,n;X^j'^^) = log ^1 - i 

There are two important reasons for introducing the Gaussian mixture model 
above. First, as the multiple-looks model can be converted into the Gaussian mix- 
ture model by random shuffling, we expect that these two models are closely related. 
In fact, compare the two log-likelihood ratios: LRn,m and on one hand, as 

we will see in Section 3, with particularly chosen parameters (s, r), the difference 
between LRn,m and CRn.m is in fact negligible; on the other hand, clearly, C'R„,m 
has a much simpler form than that of LR^^m, and thus it is much easier to analyze 
than LRn,m- lu sliort, the study of the Gaussian mixture model will provide 
an important bridge for studying the multiple-looks model. 

The second important reason is that, the Gaussian mixture model itself is of 
importance and has many interesting applications. In [11], we mentioned three ap- 
plication areas where situations as in Model (2.2)-(2.3) might arise: early detection 
of bio-weapons use, detection of covert communications, and meta-analysis with het- 
erogeneity. There are many other potential applications in signal processing e.g., 
[22, 23, 24]. 

The main result on the problem of hypothesis testing the Gaussian mixture 
model, or Model (2.2)-(2.3) is the following. 

Theorem 2.1. For parameter 0 < s < 1 , let = v/2.slogn and 


m* = m*(n,.s) 
then as n oo, 


0<s<l/3, 

\/2^ • fin,s • 1/3 < s < 1, 


1. When 0 < s < 1/3, 


C.'Rn.m' X{—\/2,\), under Hq, 
Cn„.m‘ ^ ^’(l/2, 1), under H"’”'’. 


2. When s = 1 /3, 


^ iV(-l/4, 1/2), under Ho, 

Cn„,.n- ^ N{l/4, 1/2), under H”''”*. 

3. When 1/3 < s < 1 , 

CRu.m' under Hq, under , 

where i/^ and t/j ate the same as in Theorem 1.1. 
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Similarly, there Ls a threshold effect for the hypothesis testing of the Gaussian 
mixture model, and so the detection boundary. In the s-r plane, the detection 
boundary of the Gaussian mixture model is: 

r = p*(.s), 

which is exactly the same as that of the multipkviooks model; stx! more discussion 
on the Gaussian mixture model in [11]. 

Ingster [20] studied a similar problem and noticed similar threshold phenomena, 
see more discussioius in Section 7.3. There are many other studies on the detection 
of Gaussian mixtures using LRT, see [9, 10], and ]17| for example. 


2.1. Proof of Theorem 2.1 

For the proof of Theorem 2.1, the approach below is developed inde|>endent ly and 
is different from that in ]20j; the approach lielow is also generalized to the settings 
of non-Gaussian mixture which we will dLscuss in Section 6. 

Denote the density function of jV(0, 1) by 

m (2.4) 

\Z2ir 

To prove Theorem 2.1, we start with the following key lemma: 

Lemma 2.1. H'ith /i„ = as defined in Theorem 2.1, 


j Jpi(log(l + r>) _ J _ rf, 

' i< + f^ +0(1) 


• l‘„ ■ u 


and 


2 

1 

y/2n 


(it + o(l)) • /4„ ■ n'"*-” , 
it + o(l) 


0 < .s < 1 /3. 
s = 1/.3, 
l/3<s< 1, 


0 < s < 1/3, 
•s = 1/3. 


(2.5) 


(2.6) 


- V’!,’(0] +"(l)i 1/3<.S< 1, 

where V'?(0 ond d>l{t) are defined in Theorem 1.1. 

Let N' = N'(n,.s) = n • m’{ri,s), to prove Tluwem 2.1, it is sufficient to show 
that: 


under Hq: = 


1 - (»t + <2 + o(l))/(2iV'), 0<.s<-, 

1 -2 + „(1))/(4A-). .,= i, 

1 + (t/'?(0 + o(l))/A", 


(2.7) 
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and 


\-^{it-e + o{\))/{2N*), 0<s< i 

o 


under = < 


l + (i^-^2 + o(l))/(4^'•), 

1 + WW + u(l))/N-, 


*=3' 


3 < ^ < 1 ; 


( 2 . 8 ) 

in fact, by a direct result of (2.7)-(2.8) is that as 

n —*■ oo, we have the following point-wise convergences: 


under Hq: 




and 


under 


^-(it+t^)/2^ 0 < s < 1/3, 

g-(tf+t")/4^ s = 1/3, 

1/3 < s < 1, 

g(it-f=*)/2^ Q<s<l/3, 

g(a-t2)/4^ s = 1/3, 

l/3< s < 1, 


.V'i 


and Theorem 2.1 follows. 

We now show (2.7). Under Hq, notice that: 


Ee^icn^r 


— f g«tlog(l-l/n+(l/n)e>'"*-'‘n/a)^^^^ 

J — OO 


dz 


(2.9) 


_ g»MoK(l- 


1/n). n + 

J — OO 


10 ) 


rewrite: 


/ oo 2 

(2.H) 

•OO 

J—oc 

+ 0{\/n^y, (2.12) 

the key of the analysis is using the .substitution e*' = (l/n)e^"*“'*"/^: 

r [gniog(i+(i/n)«^n.-.?./2) _ J _ , (l/n)e'^"--/'n/2]0(^)rf^ (2.13) 

J — OO 

= — c’^"" )dz; (2.14) 

A^ri J-oc V/^n/ 

combining (2.9)- (2.14) with Lemma 2.1 gives (2.7). 

The proof of (2.8) is similar. Under H'/’"'’, 

= (1-1/n)-/ (2.15) 

%/ — OO 

+ (l/n) ■ /“ - ^)dz, (2.16) 

J — OO 
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the first term ran be analyzer! similarly as in the case under Hq, as for the second 
term, similarly we have: 


f pU loK( l-l/n+( ^ 

J — oo 

= 1+ /" - l]c>(z)<;j +0(l/n) 

J —oc 


(2.17) 

(2.18) 
(2.19) 


combining (2.15)-(2.19) with (2.9) and Umima 2.1 gives (2.10). 

This concludes the proof of Theorem 2.1. □ 


2.2. Proof of Lemma 2.1 

As we mentioned before, an interesting phenomenon for the detection of the multiple- 
looks model is that, the detection boundary is partly linear and partly curved; the 
whole curve only has up to the first order continuotus derivatives. As the intuition 
for why this phenomenon happens had been rieveloped in [11], here we try to un- 
derstand the ithenomenon frf>m the angle of analysis. 

In fact, take (2.5) for example, as p„ — » 'x,. the integration 

/■* _ J _ (2.20) 

J —oo V / 

behaves totally different for the cases 0 < ,s < 1/3 and 1/3 < s < 1. The reason is 
that, by dropping the term the integrand in (2.20) is alisolute integrable 

if and only if (1 -I- s)/(2.s) < 2, or etiuivalently 1/3 < .s < 1; to see this, notice that 
the only possible place couUI make the integration to diverge is 2 = — oo, observe 
that when 2 < 0 and jzj very large: 

p.«loK(l+e*) _ , _ ^ (2.21) 


it immediately follows that the integration diverges if and only if (1 -t- ,<!)/2.s < 2, 
or 1/3 < s < 1. 

As a result, when 1/3 < s < 1, (2.5) follows easily by Dominated Convergence 
Theorem. In fact, recall the definition of t/>" and by noticing the |K)int-wise conver- 
gence of (t>(zlfi„) to we have: 



log(l+r*) 



^t^?(0 + o(l). 


However, when 0 < .s < 1/3. the integration goes to to oo as /t„ — • cx, so we need 
to analyze differently. In fact, using (2.21), we have: 


= r ')d2 + 0(l) 

J —0(j \^n / 

= -\(it + «") d2] (1 + 0 ( 1 )) -I- 0(1) 


= + (1 + «(!))• 


The remaining part of the proof is similar, so we skip it. See [26, Chapter 2] for a 
more detailed proof. □ 
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3. Proof of Theorem 1.1 


As we mentioned in Section 2, the multiple-looks model (1.4)-(1.5) can be converted 
into the Gaussian mixture mo<lel (2.2)-(2.3) by random shuffling, we thus expect 
the difference between the log- likelihood ratios and CR.u,m' to be negligible, 

or 

+Op(l). (3.1) 

As a result, the limiting behavior of would be asymptotically the same as 

that of C1Z„,m- in Theorem 2.1. 

Motivated by these, our approach for proving Theorem 1.1 is to, first validate 
(3.1), and then, combine (3.1) with Theorem 2.1. 

VVe now show the cases under Ho and under //”'"* separately. 

(fc) iid 


F’irst, under Ho- For Zj ’ ~ N{0, 1), I < j < n, \ < k < m, let: 


= = 1 




Li=i 


(3.2) 


1 _ 1 + 

n n 






(3.3) 


= 4‘’)=(n 

then und(‘r Hq, by symmetry: 

rn* m* 

LRn,m- = log(u<*^)), CRn,m- = Y 

*.-=1 *.— 1 

intuitively, since for a sequence of small numbers Uj, 0^=1 (1 +nj) 1 + "j* so: 


4. y(^‘) w 1 4- ^ 
j=i 


1 1 /o 

I "*> en/^ 


n n 


= uW: 


we thus expect that the difference between and CR-n^m- is indeed negligible. 

I^t 




u 


(A.) 


u(fc) ’ 


(3.4) 


then: 


cn„,,n- - LRri,,n- = Y 


k=l 

the following Lemma validates the he\irism, or (3.1), under the null hypothesis Hq: 

Lemma 3.1. A^(0, 1), 1 < j < n, 1 < ^' < m, the.n for fin = \/2.slog(n) 

and 


m = 


we have: 


V2tt ■ fin ■ , - < s < 1, 


Y Jos(l + ~*v 

Ar=l 
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Combining Lemma 3.1 with Theorem 2.1 gives Ttieorem 1.1 under Hq. 

Now under H”'"' , Xj*' = \ where jo{k) uniformly distributwl 

over {1, 2 , . . . , n}; so by symmetry: 


— D ^ ' 
k=\ 

and we can rewrite: 
L/^n.rn* 




m* / 1 ^ \ * 

5:,og(i5:e'-r>-‘iM 

,k=l \ j-2 / . 

+ [ V loef 1 + 5 


(3.5) 


By the study for the case under //q, the first term on the right hand side atxtve 
weakly converges to: 


V f iV(-l/2,l), 0<s<l/3, 

^logl { N(-l/4,l/2), s = l/3, (3.6) 

*=i V”j=2 / [ l/3< s< 1, 

with i/J defined in Theorem 1.1, .so all we need to study is the second term. The 
following Lemma is proved in [26, Cliapter 4J. 

Lemma 3.2. Fixed 0 < a < 5, with fi„ = #i„., = v/2sTog7t, then for 
X(0. 1), 1 < j<n, 

< a} < 2e“l'^**'^ e.. n‘' '’(i+o(i))l^ jj f. > j 

With some elementwy analysis. Lemma 3.2 implies: 


— ^ — > 1, in probability and in D', Vp > 0. 
e'*' 

Now back to the second term on the right hand side of (3.5), or: 
T\og(l + L 


(3.7) 


inspired by(3.7), we ex|>e<'t that there will be only a negligible change if we replace 

the mes.sy term |(l/n)Z]>=2c'‘"*> by 1 for all k; this turns out to be true, 

and we have the following lemma: 

Lemma 3.3. For fi„ = p„ , and m" = 7n*(n, s) defined in Theorem 1.1, if 2^** 
X(0, 1), 1 < J < n, 1 < fc < m', then: 


|log^l + — log^l + • 


1 V*" 
n 2 ^j=2 


-rl/2 


71 


’+e'„/2^| 
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Applying Leiimia 3.3 directly to (3.5) gives: 


,k=l V j=2 Ji U=t ' " ''1 

(3.S) 

But for the .second term in (3.8), oljserve that for any t, by .sulistitution e* = 




/^n J 

by inclei>eiKlency: 


we then derive: 


0 < s < 1/3. 


^logfl+ if'"' i 1/2, s = l/3, (3.9) 

^ [ 1^;, l/3<s<l, 

where i/* is the distribution with characteristic function in-serting (3.6) 

and (3.9) into (3.8) gives the proof of Theorem 1.1 under //"'"* . □ 


3. 1 . Proof of Lemma 3. 1 

A detailed proof of Lemma 3.1 is available in [26, Chapter 4], In this section, we 
will only illustrate the main ideas for the proof, while skipping the technical details. 
Direct calculations show that: 


1 + „,(*) > (1 _ i/„)" . Mlrdi ^ ^ > (1 - 1/n)", 

so when n > 2, there is a constant C > 0, such that: 

I log(] + w<*’) - «:<*') I < C ■ 
and to show Lemma 3.1, it is .sufficient to show: 


„«=) 


, 0 , 


(3.10) 


Split: 

ie<*) = «(*■-) + u<*' • - 1^ • l{„d.)>l/3) + - 1 j • 


• l(„(k)<l/3) + • l{„(4)>l/3}; 

using Lemma 3.2, the remaining part of the proof is careful analysis, see [26, Chaj)- 
ter 4] for details. □ 
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3.2. Proof of Lemma 3.3 
It is sufficient to show: 


f [log(l + ie— - log(l + ^ 0, 

where N{{), 1) and are independent of But since for any x, y > 0, 

log(l + x) - log(l + y) = (x - y)/(l + x) + r(x,y), where the reminder term 
< C{x — yY for some constant C, so all we need to show is as n — » oo: 




and 


,1 + 


k=l'-^ ' 






t 2 


0; 


(3.12) 


or ecjuivalently, for any fixed <: 




(3.13) 


Similar to the proof of Theorem 2.1, using substitution e* = we then 

rewrite: 


El e 


jt[ (»/n)r'‘" . .. > 


J — oc L 


-1 


4> 


ii) <= 


(3.14) 


iuid 


J -0^^^ d2, (3.15) 


_ 1 r 

— f-hi 

f^n 


where on the right hand side, the expectation inside the integral sign is with respect 
to the law of Again by Lemma 3.2, the remaining part of the proof is careful 
analysis. See (26, Chapter 4] for the technical details. This concludes the proof of 
Lemma 3.3. □ 


4. Proof of Theorem 1.2 

We prove Theorem 1.2 for the cases r > p*{s) and 0 < r < p*{s) separately. 
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For the case r > ^)’(s), by the definition of m’ and m, for (s,r) in this range, 
m/m' — * oc as n — ► 00 . First we consider the case under //«, let: 


{ s/mjm', 

\/m/(2m*), 

\Jmjm- ■ v/-(V>“)"{0), 

f W(2m'), 


0 < .s < 1/3, 
s = 1/3, 

1/3 < s < 1, 

(4.1) 

0 < s < 1/3, 
s = 1/3, 
l/3<s< 1; 


roughly say, 6„ is the mean vtilue of LRr,,m, and a„ is the standard deviation 
of LR„,,n- By Theorem 1.1 and elementary analysis, it follows that [LR„_m - 
i»r.]/a,i A^(0, 1), and thus Lfl„,„/^m/m* —>p -oo under Ho- Similar argu- 
ment shows L/in.m/\/m/m' — >p oo under this concludes the proof of 

Theorem 1.2 in this case. 

We now coasider the case r > p*(s). First, we briefly explain why the proof is 
non-trivial. Recall that, LR„,„, converges to 0 in probability, under the null as well 
as under the alternative — which is a direct result of the studies of Sections 2 3; 
however, this claim alone Ls not sufficient for proving Theorem 1.2 in this case: the 
Kolmogorov-Smirnov distance between two random sequences could tend to 1 even 
when both of them tend to 0 in probability, the culprit Ls the discontinuity of the 
cdf function of i/q (here uu denote the point mass with all mass at 0). 

However, recall that given a cdf F which is a continuous function, then for any 
sequence of cdf’s such that F„ F, we have: 


lim II F„ - F||ks = 0, (4.2) 

n— *oo 

see, for example, [12], Motivated by this, we need a stronger claim of the limiting 
behavior of LR„_m- Namely, for any fixed (s,r) in this range, we hope to find a 
sequence of numbers {f„ = fn,»,r}^i such that: 


t„ ■ LR„,,n F, (4.3) 

both under the Ho and where F Ls some continuous cdf function. 

This turns out to be true. Consider the following sub-regions of the undetectable 
region {(s,r) : 0 < s < 1,0 < r < p'(s)}: 

flu. 0 < s < 1/4 and 0 < r < p"(s), or 1/4 < s < 1/3 and 4s — 1 < r < p'(s). 

Qb- 1/4 < s < 1/3 and r = 4s — 1. 

fir- 1/3 < s < 1 and 0 < r < p'(s), or 1/4 < s < 1/3 and r < 4s — 1, 

the following theorem is proved in the Appendix: 

ThiBorem 4.1. For p„ = — v/2slogn, and 

f n'', (s, r) € Ha U Qb- 

m = < — 

V2ir-p„-n’', {s,r)eUc, 
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let (n = (n.T = where 


T = t{s, r) = 


{ 


1 - 2s - r, 

2(l + s-2v^s(l + r)), 


(s, r) 6 n„ U Ofc, 
(s, r) e fin, 


then under Ho as welt under 


(„ • LRn 


JV(0,1), 

A^(0, 1/2), 

1 -n 




(s,r) e n„, 
(s, r) e Hfc. 

(s, r) e fir, 


where is the distribution mth characteristic function and il>°^(t) - 

— 1 — ite‘)e~ 

Adapting to our notations. Burnashev and Begmatov [8] has studied the limiting 
behavior of LR„,m< with m = 1. 

We remark here that in Theorem 4.1, the log term in the calibration of m is 
chosen for convenience. A similar result will be true if we take rn = n'' without any 
log term, and at the same time adding some log term to f„. 

We now finish the proof of Theorem 1.2 in this case. To do so, we first cheek 
that i>® ^ indeed has a bounded continuous density function. In fact, by substitution 
X = te‘, we can rewrite: 

(4.4) 


where in ± the upper sign prevails for f > 0, and ^ is a complex number determined 
by: 

= .y [e“ - 1 - ix] • 

with T defined above and (s,r) € He, by elementary analysis, 1 < (1 + s - 
r/2)/(2s) < 2, and that I'®* ® bounded density function. 

Now let F,,r be the cdf of A^(0, 1), N{0, 1/2), and according to (s, r) € n„. 
Hi,, and Qc, notice that F,.r is a continuous function; now for any fixed (r, .s) in the 
undetectable region, combining (4.3) with Theorem 4.1 gives: 


lim F 


(n.m) 


-Fl 


(n.m) I 


KS — 


lim 


^(n.rn) 

0 


-F.. 


IIf'" "** - F,,.|| 


J =0; 


(4.5) 

it then follows that, for any sequence of thresholds the thresholding pro- 

cedure that reject Ho when LR„.m > tn has an asymptotically equal to 1 of sum 
of Type I and Type II errors, uniformly for all sequences 


lim [PHo{hR„.m > t,,} + P,t^ '-{LRn.m < fr.}] = 1- 

n— *(3D *■ 1 ^ 


Last, since for fixed r,s, and n, among all tests, the Neyman -Pearson likelihood 
ratio test with a specific threshold has the smallest sum of Type I and Type II 
errors, see, for example, [28], it then follows that the sum of Type 1 and Type II 
errors for any test tends 1. This concludes the proof of Thettrem 1.2 in this case. □ 
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Remark. We now give a short remark about tiie distribution of First, it was 
pointed out in [15] that, for a characteristic function with ip in the form as 
that in (4.4), its corresponding distribution has a finite pth moment if and only if 
p < (1 + s — r)/(2s); thus has a finite first moment, but not a finite second or 
higher moment. Second, it would be interesting to study whether (or when) ^ is 
a stable law; is a stable law if and only if that in (4.4), |^| < 2-(l +s — r)/(2s), 
.see, for example, [15]; we skip further discussion. 

5. Proof of Theorem 1.4 

To prove Theorem 1.2, we note that it is sufficient to show 

lim < v/ 4 log log N } = 0. (5.1) 

The key for proving (5.1) is to argue that the distribution of HC^ under will 

keep the unchanged if we replace the original sampling procedure by the following 
simple procedure: draw independently a total of N samples, with the first m from 
N{pn, 1) and the remaining N — m from A^(0, 1); we refer the latter as the simplified 
.sampling. 

In fact, if we use HCl/ to denote the Higher Criticism statistic based such 
samples obtained by simplified sampling. Compare HC*i^ with HC^, for any set of 
integers 1 < ji, • • -Jm < n, let be the event: 

{jo(l) — J2> • • • 1 Jo(^) — jm}\ 

by symmetry, conditional on HCn* equals to HC)^ in distribution: 

=d nc-^, 


we thus conclude: 

HC'^ =D nCi,. 

By the above analysis, it is clear that to show (5.1), it is sufficient to show: 

Jii^ P{HC% < v/41oglogA^ } = 0; (5.2) 

where the probability is evaluated for samples obtained by the simplified sampling. 
The proof of (5.2) is similar to the proof of Theorem 1.2 in [11], and we skip the 
technical detail. □ 

6. Extension 

In this section, we extend our studies to certain non-Gaussian settings, or the 
Generalized-Gaussian settings. The Generalized Gaussian (Subbotin) distribution 
GN^(p) has density function <f>-y{x — p) where 

<^7(2^) = ^exp(--!^), C^ = 2r^i^7^"^ (6.1) 

This class of distributions was introduced by M. T. Subbotin 1923 ([31]) and has 
been di.scussed in [27, p. 195]. The Gatussian is one member of this family: namely, 
the one with 7 = 2. The case 7 = 1 corresponds to the Double Exponential 
(Laplace) distribution, which is a well-understood and widely-used distribution. 
The case 7 < 1 is of interest in image analysis of natural scenes, where it has 
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been found that wavelet coefficients at a single scale can l>e modelled as following 
a Subbotin dLstribution with -> ss 0.7. This suggests that various problems of image 
detection, such as in watermarking and steganography, could reasonably use the 
model above. A direct extension of the Gau.ssian mixture model (2.2)-(2.3) is the 
following: 


Ho- X^'‘^ CN.^(0), l<j<ri.\<k<m (6.2) 

i.ud (i/„)g;v.,(p), 

1 < j < n. 1 < < rn. (b-3) 

where we choose the calibratioiLS in a similar way to that in the Gaussian setting: 

/I = — (')'.slog(ri))*^^, rn = n'", 0 < ,s < 1, 0 < r < 1. (6.4) 

Similar to the Gaussian case, for r and s in this range, this is again a very sub- 
tle problem. Recall that we mentioned in Section 1, the Gaitssian Mixture model 
provides an important bridge for .studying the (Gaussian) multiple-looks model, 
an<l which is also easier to study. For this reason, in this section, we will focus 
on the extension of Gaussian mixture model only. It would be interesting to work 
on a direct extension of Model (1.4)-(1.5), or non-Gaussian multiple-looks model; 
heuristically, l>as»Hl on Theorems 6.1 and 6.2 lx4ow, parallel results for Theorems 1.2 
and 1.4 should still hold if we replace the Gaussian noise setting by the Generalizetl- 
Gaassian noise setting. 

In this section, we will drop the subscript -y whenever there is no confusion. 
6,i, Log- likelihood ratio and limit law 

In this section, parallely to the Gaassian case, we discu-ss the limit law of the 
log-likelihood ratio statistic. Let tf(z|/j) = ff(z|/i,7) = (),e„ the Jog. 

likelihood ratio of testing Model (6.2) (6.3) is CR-„ „, = = 

Er=,E;=,£^r-'^here 

enf = = log(l - 1/n + (l/n)j,(xj*>|/i,7)); (6.5) 

We now discuss the cases 7 > 1 and 0 < 7 < 1 separately. 

First for the case 7 > 1. This case includ<!s the Gaussian (7 = 2) as a special 
case. Adapting to the notations in [26, Chapter 3], let 

so(7) = (2^ -1)V(2^ -1), 

01(7) = [l-(l/ 2)‘/<’-‘)]'-\ 

6,(7) = [l-27^]/[(l-2^]^, 

and I, = x,(7) be the unique solution of the is)uation 

x'' — (x — I)"* = -, I > 1; 
s 

notice here 7 = 2 corresponds to the Gaussian case: ai(2) = 1, 6i(2) = — 1, «o(2) = 
1/3, and x,(2) = (1 -1- s)/(2s), which are the same as we derived before. The main 
result for the case 7 > 1 is the following theorem: 
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Theorem 6.1. For parameter i) < s < 1, let /i„ = = {l ■ s ■ 

m* = m*(n, s, 7) 

^ f (l/C,)-[2)T/((l-'r)6,(7))]'^^,,;r’'^.n‘-“'W', 0<s<so(7). 

\ Cy ■ f,l-' ■ s„(7)<s< 1, 

and CTZnjn* = J^'F.u,m',s,-f} then as n —* 00: 

1. When 0 < s < so{'y), 


1^, under Ha, 

CTln.m' nQ, 1^, wider H"’”* . 


2 . When s = so(7), 

CTZ,„m- N{-l/4, 1/2), tinder //o, 
Cn„,„,- ^ N{l/A, 1/2), under . 


3 . ly/ien .so(7) < s < 1 , 

CR„^„,‘ under Hq, CKn,m' under \ 

where u'‘l^ and u\ ^ are distributions with characteristic functions e^* "* and 
e'^’-.-r respectively, and vnth Ws,^ = a^^(7)/[ v.(x,(-y) ' -i)->-i “ 1)^ 


<->(0 


/ oo 

jg»iiog(i+e*) _ j _ 

•00 

/ oo 

JgUlog(l+e*) _ 1] 

•OO 




^■*d^. 


( 6 . 6 ) 

(6.7) 


In Section 6.3, we will discu.ss .several issues about the laws ^ and it was 
validated in (26, Chapter 2] that both ^ and ^ are in fact infinitely divisible. 

We now disciLss the case 0 < 7 < 1, this case include Lajilace (7 = 1) as a 
special ca.se; the main result for this case is the following theorem: 


Theorem 6.2. For 0 < 7 < 1 and 0 < s < 1, let 

tin = Pn,H,-r = (7slogn)-r, tu* = m*(n,s,7) = ^ (3/2) .„i-^ 


and CR-n^ni- = C.'R-n^m' then as n — + 00: 




CUn , 


m' 


-^,1^, under Ha, 
t ’ under . 


7 < 1, 
7=1, 

( 6 . 8 ) 


Theorems 6.1 and 6.2 are proved in [26, Chapter 3]. As 7 = 2 corresponds to the 
Gaussian case, the study in Section 2 is a special case of Theorem 6.1; however, in 
comparison, technically we need much more subtle analysis to prove Theorem 6.1 
than TluH)rem 2.1. 

In this paper, we skip the proof for Theorem 6.1 and Theorem 6.2. 
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6.2. Detection boundary 

Similar to the Gaussian case, Theorem 6.1 inijjlies that there is a threshold effect 
for the detection problem of (6.2)-(6.3). Dropping some lower order term when 
necessary, m* would be reduced into a clean form: m* = where 


Py(s) = 1 - s, 



1 - 01(7) • s. 


0< s < so(7)» 
«o(7) < s < 1, 


0 < 7 < 1, 
7 > 1. 


Similarly, in the s-r plane, the curve r = p!J(s) separates the square {(s,r) : 0 < 
s < l,0<r< 1} into two regions: a detectable region above the curve, and an 
undetectable region below the curve; we calk'd r = p* (s) the detection boundary. 

Theorem 6.3. For 7 > 0, let /q, = = (7 • log(n)) m = and 

C.'R.n.m — 

1. VVTien r > consider the likelihood ratio test (LRT) that rejects Hq when 

> 0, then the sum of Type. / and Type II ertvrs tends to 0; 


Pnf,{ Reject Ho} + Accept Ho} — ' 0, n -* 00. 


2. When r < p^{s), 


lim 

n-*oc 


II 

Ir 0 



where and are the cdf's of LR„^m under Ho and 

tively. As a residt. the sum of Type I and Type. II errors for any test tends 

to 1: 

Pho{ R eject Ho} + P.^(,i.m){ Accept //o} — » 1, n — > 00. 


The proof of Theorem 6.3 is similar to that of Theorem 1.2, and we skip it. 

In (111, we have studied in detail the performance of Higher Criticism statistic 
for Model (6.2)-(6.3), and showed the Higher Criticism is also optimal a^laptive 
for Model (6.2)-(6.3) with any fixed 7 > 0. It is interesting to notice that for 
any fixed 7 > 1, the detection boundary is a partly linear (0 < s < so(7)) and 
partly curved (50(7) < .s < 1). Again, this im{)lies that the detection problem is 
essentially different for those parameters (.s, r) near the linear part and those near 
the curved part. Asymptotically, when (s, r) is close to the curved part, statistics 
based on thase a few largest observations would be able to effectively detect, while 
w’hen (s,r) is close to the linear part, statistics basetl on a few largest observations 
will completely fail, and only the newly propo.sed statistic Higher Criticism, or the 
Berk-Jones statistic, which is asymptotically equivalent to the Higher Criticism in 
some sense [5, 11), is still able to efficiently detect. See [11] for more discussion. 

Moreover, notice that wlien 7 > 1 approaches 1, the curved part of the detection 
boundary continues to shrink and eventually vanishes, leaves only the linear part. So 
when 0 < 7 < 1, statistics based on the largest a few observations would completely 
fail for all 0 < s < 1. However, Higher Criticism and Berk-.Iones statistics would 
still be efficient. 

In Figure 5, we plot r = for 7 = 3, 2, 1.5, and 7 < 1. Notice that 7 = 2 
corresponds to the Gaussian case and P 2 = p*- 
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Figure 2: Illustration of Wg^-, as a function of s for fixed 7. Prom left to right, three 
curves correspond to over intervals [so(7)> 1] for 7 = 3, 2 and 1.5. 


6.3. Remarks on the infinitely divisible laws 

In this section, we addressed several issues about the infinitely divisible laws 
and ul 

The distribution of or is uniquely determined by the value of Wg^.y. By 
elementary analysis, for fixed 7, when s ranges between 50(7) and 1, Wg^-f strictly 
decreases from 1 to 0. In Figure 2, we graph Wg^~, as a function of s with 7 = 1.5, 2, 3. 
Notice that 7 = 2 corresponds to the Gaussian case, and 


Wg ,2 = (1 -s)/(2s). 


j 0 I 1 

As 0 < < 1, it is easy to check that and are absolute integrable; 

thus by the inversion formula ([12] for example), both and ^ have a bounded 
continuoiLS density function. In Figure 3, we graph the density functions for or 
with Wg^.y = 0.4, 0.5, 0.6 separately; recall that the density function is uniquely 
determined by Figure 3 suggests that, heuristically, the smaller the tCa,7» the 
better separation between and j/j it would be interesting to validate this, 
but we skip further discussion. Notice here that the density functions correspond 
to = 0.5 are the same as those in Figure 1, where Wg^^ = 0.5 since we take 
.s= 1/2,7 = 2. 

Last, we claim that i/® has a finite first moment as well as a finite second 
moment, and so does i/j .^. In fact, elementary analysis shows that the second deriv- 
atives of both and e'^*-’> exist, so the claim follows directly from the well-known 
theorem, that the existence of the second derivatives of characteristic functions im- 
plies the existence of the second moments, see ([12, p. 104]). Moreover, the first 
moment of i/*? and ul are: 


J [log(l-he*)-e^]e-<‘+'"‘-^>*d2, J [(l -l-e*) • log(l -f-e*) - d2. 


and are negative and positive respectively; the second moment of them are: 


j [log-^(l-he*)]e-(i+*"‘-^)*d2, J [(H-e*) •log''(l-f 


It would be interesting to study that, whether higher order moments exist for 
or J/] .y. Here we skip further discussion. 
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Figure 3: Density functions for g and i/] ^. The distributions of Ug ^ and ^ 
only depends on Left: from top to bottom, density functions for Ug ^ with 
= 0.4, 0.5, 0.6. Right: from bottom to top, density functions for ^ with 
= 0.4, 0.5, 0.6. 


7. Discussions 

7.1. Re-parametrization and detection boundary 

In Section 6, we calibrated the amplitude of the signal fx and the number of frames 
m through parameters s and r by: 

= (7 • 0<s<l, 0<r<l. 

This particular calibration is very convenient for discussing the limit law of the log- 
likelihood ratio: in order to make the log-likelihood ratio converge to non-degenerate 
distribution, the critical value of m = m* may contain a log term, namely in the 
case s > 80(7). When we attempt to develop a different (but equivalent) calibration, 
this log term may complicate the notation .system quite a bit. However, the above 
calibration is not convenient for the di.scu.ssion of the detection boundary. Recall 
that the detection boundary for the Gonerali/ed-Gaussian Mixture model (6.2)- 
(6.2) in the s-r plane is r = Py{s), where: 

(X 7 < 1, 

7 > 1 ; 

unfortunately, here Xg{-y) is the solution of x** — = 1/s, which doesn’t have 

an explicit formula. In addition to providing a completely explicit formula for the 
detection boundary, the following calibration we will introduce might also be more 
familiar. As before, let N = n-m be the total number of ol)servations, and es denote 



0 < s < .so(7), 
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Detectable 


Undetectable 


Figure 4: Left Panel: detection regions for the Model (1.4)-(1.5) as well as Gaussian 
mixture model (2.2)-(2.3), the detection boundary separates the detectable region 
(above) from the undetectable region (bottom). Right panel: detection regions in 
the (3 — a plane by the re-parametrization in Section 7.1. The detection boundary 
separates the detectable region from the undetectable region. The mapping of the 
re-parametrization maps the line segment {(s, r) : s = 1,0 < r < 1} in the left 
panel to the line segment {a = 0 : l/2</?< 1}, which separates the estimable 
region (top) from the non-estimable region. When (a,0) falls into the estimable 
region, it is possible not only to detect the presence of nonzero means, but also to 
estimate those means. 


the fraction of observations containing a signal, so m = N ■ and n = 1 /cn'i we 
now introduce parameters {j3, a) and let: 

c/v = = (7«log«)*'^'^; 


this re-parametrization is ecpiivalent to a simple transformation: 


/?=l/(l + r), Q = s/(l-|-r), !/2</3<l, 0 < a < 1; (7.1) 


elementarj' algebra enables us to rewrite the detection boundary r = p* (.s) as: 


a = = 


[2i/(7-i) _ ijT' ^ .{f3- 1/2), 1/2 < /? < 1 - 

(1 - (1 - / 3 )^/'>)^, 1 < 1 . 


Figure 4 can help to understand the re-parametrization. In fact, the above 
transform is a one-to-one mapping, which maps the squared region in the s - r 
plane {(s, r) :0<s< 1, 0<r< 1} (left panel) to the region in the 0-a 
plane which formed by cutting the triangular region on the top off the square 
{(/?, a) : 0 < a < 1, 0 < /3 < 1} (right panel). Moreover, the new sub-regions 
above/below the curve q = p*(/3) is the image of the detectable/ undetectable re- 
gions. See Figure 4 for more illustration. For Model (1.4)-(1.5), a problem closely 
related to the detection problem we have discussed in this paper is the estimation 
problem: with the same calibration, what is the critical value of m such that the 
signals can be reliably estimated^ Surprisingly, though multiple-looks is helpful for 
the detection, it is not at all helpful for estimation; and in order that the signal be 
estimable, we have to set the parameter .s > 1, or /i > \/2 log ri; this range of s is 
not showed in the left panel of Figure 4. But by (7.1), s > 1 <=» a > 0, so in other 
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Figure 5: Left panel: Detection boundaries in tlie s-r plane for Model (6.2)-(6.3), 
with 7< 1, and 7 = 1.5, 2, 3 from top to bottom. A small dot separates each curve 
into two parts, the solid part of the curves are line .segments. Right panel: The same 
detection Boundaries in the j3-a plane after the re-parametrization defined in (7.1). 


words, in order that the signal be estimable, we need to pick (o,/?) from the trian- 
gular region on the top of the right panel in Figure 4; we call this triangular region 
the estimable region. A similar problem was di.scu.s.sed in [1], with Model (2.2)-(2.3) 
instead of Model (1.4)-(1.5). 

7.2. Discussions on Model (1.4)-(1.5) 

We now address several issues about the multiple-looks model. Model (1.4)-(1.5). 

First, in astronomy, there is a Poi.sson version of the multiple-looks model. As it 
is of interests to study directly the Pois.son model rather than the Gau.ssian model 
in this paper, the Gaiussian model is more convenient to study, and reveals insights 
about the Poisson model. 

Second, in Model (1.4)-(1.5), we have assumed that each A'j^^ has equal variance 
either it contains a signal or not . It is interesting to consider a more general case, 
in which we assume that, the pixels containing signals have equal variances > 
1, while all other pixels have equal variance 1. Our study in this paper can be 
generalized to this case easily, and the parameter a should have some scaling effect 
on the detection boundary r = p’(s). 

Last, it is interesting to study what hajjpens if we relax some assumptions of 
Model (1.4)-(1.5). For example, instead of assuming that exactly one pixel per frame 
possibly contains a signal, we coukl consider a harder problem that, in each frame, 
there is more than one pixel possibly containing a signal with equal mean, while 
the position of such pixels are (independently or not) .sampled from {1,2, ...,n}, 
but independently from frame to frame. Ileuristically, if the number of those pixels 
containing a signal are relatively small, we should be able to show that, this model 
al.so can be converted approximately into a Gaussian mixture model by random 
shuffling; notice that the study of the resulting Gaiussian mixture model should be 
.similar to that in Section 2. 

7.3. Relation to other work 

There are two points of contact with earlier literature. The first one is with Bur- 
nashev and Begmatov [8], who studied the limit law of log-likelihood ratio with a 
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setting which can be translatecl into ours with large n but m = 1. They showed that, 
for n iid sample Zi ~ N{0, 1), with approximate normalization, 
weakly converges to a stable distribution as n — > oo. It is interesting to notice here 
that, the non-Gaussian weak limits in Theorems 2.1 and 6.1 are infinitely divisible, 
but not stable. It would be interesting to study whether the non-Gaiussian limit in 
Theorem 4.1 is stable or not. 

The second point of contact is with the beautiful series of papers by Ingster 
[19, 20], and [21]. Ingster studied extensively the Gaussian mixture model (2.2)- 
(2.3), ranging from the limit law of the log-likelihood ratio as well as the minimax 
estimation of signals lying in an ball. These papers revealed the same limiting be- 

havior of log-likelihood ratio (and so the threshold effect) as discussed in Section 2. 
Our approach in Section 2 was developed independently. 

In this paper, our starting point was the multiple-looks model (1.4)-(1.5), which 
is different than the model studied by Ingster. We found that we could treat the 
multiple-looks model by proving that, after a re-expression of the problem, we 
obtained convergence in variation norm to the Gaussian mixture model (2.2)-(2.3), 
which we then analyzed. Hence, although we obtained eventually the same results 
as Ingster, our application and motivation were different. We think the alternative 
viewpoint adds something to the discussion. Moreover, the extension to the studies 
on generalized-Gaussian mixtures in Section 6 has not been studied before, and 
various effects of the parameter 7 are interesting. 

8. Appendix 

In this section, we will prove Theorem 4.1. Consider the following three sub-regions 
of the square {(s,r} : 0 < .s < 1,0 < r < 1}. 

LOa- 0 < s < 1/4 and 0 < r < p*(s), or 1/4 < s < 1/3 and 0 < r < 2 — 6.s, 
ujf,: 1/4 < s < 1/3 and r = 2 — 6s, 

Uc'. 1/3 < s < 1 and 0 < r < 2(1 — y/s)"^, or 1/4 < s < 1/3 and r > 2 — 6s; 

recall CTZj^^ = log(l — (1/n) -f- (1/n) • we have the following lemma: 

Lemma 8.1. If fin = ftn,8 = \/2s log n, = ^n,r = and with t = r(s, r) 

defined in Theorem 4.1, then when n 00, 



(<^ + 0(1)) 


1 - n-(2-2»)+r . 


2 


(s,r) € uia, 




(<2+o(l)) 


(s,r) 6 CJ6, 



(s, r) € uic, 


and 



/ 


1 _„-(l-2.0+r/2. 


/2 . (<^ + 0 ( 1 )) 


(s,r) 6 UJa, 


= { 1 _„-(»-2*)+r/2. 


(^^ + 0(1)) 


(s,r) € a»6, 



(s, r) € a;c, 
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mth Eg and E\ denote the expectation with respect to the law of z ~ AT{0, 1) and 
z ~ N(p„,l) respectively; here V’?,t(0 defined in Theorem 4.1, and = 



Proof. As the proof for two erjuations are similar, we only prove the first one. 
Similar to that in Section 2.1, namely (2.9) (2.14): 


= 1 + — j [e‘"-‘"*“+''‘>-l-iff„e*]<)>(2//i„)rf2 
+ 0(el/n^y, (8.1) 


by substitution e* — („ ■ e‘, we rewrite 


J ^gi(-MoK(l+e') dz 


(8.2) 

(8.3) 


Observe that (1 +s-t/2)/(2.s) > 1 for (.s, r) e uJaUu/sUWct and moreover, according 
to (s, r) inuia, uty. and Wc. (1 + s-r/2)/(2.s) > 2, = 2 and < 2; by similar arguments 
as in the proof of Lemma 2.1, we derive: 


j [^cr„ Mu*( !+.■/<-) _ n,z _ dz 

-[(t" + o(l))/2] • /z„ 

-[(t^ + o(l))/4] n„, 

(0+0(1)), 


(s, r) € Ida, 
(s,t) e oJh, 

(s, t) e idc\ 


inserting tliis back into (8.3), Lemma 8.1 follows. □ 

We now proceeti to prove Theorem 4.1. With t = r(s,r) as definerl in Theo- 
rem 4.1, observe by the calibrations in Theorem 4.1, (.s,t) € uta (n,r) € Jla. 
(s,t) 6 u»6 <=> (s,r) e flfc, and (s,r) € u;„ <=> (s,r) e fir, so by Lemma 8.1 and 
elementary analysis. 


(n ■ CHr,.m = Y, 


fc=i Li*i 


Y (^" • 


ik). 


AT(0,1), (s,r)en„, 

AT(0,l/2), (s,r)enk, 

C^TI (s- f) e fir, 


under the Hg as well as under mormver, with (s, r, t) in such range, we 

argue in a similar way as the study in Se<-tion 3 that, there is only negligible 
difference between CR.„,m and LR„ „,\ combining these gives Theorem 4.1. □ 
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Abstract: Studies of inhomogencities in long DN A sequences can be iiuiightful 
to the organization of t he human genome (or any genome). Questions alrout the 
spacings of a marker array and general issues of sequence heterogeneity in our 
studies of DNA and protein sequences led us to statistical considerations of r- 

scan lengths, the distances between marker i atid marker t+r, t = 1,2, 3 It 

is interesting to characterize the r-scan lengtiis harboring clusters or indicating 
regions of over-dispersion of the markers along the sequence. Applications are 
reviewed for certain words in the Haemophilus genome and the Cyanobacter 
genome. 

1. Introduction 

We are ha|)py to contribute this paper to the festscript volume in lionor of Dr. 11. Hti- 
bin. The palter is of pratical and theoretical ap]>lication. 1 also had the pleasure to 
develop with Herman an extendt-d analysis concertiiug a family of flistributions in 
Itossession of a monotone likelihood ratio ([1. 2]). 

Question about spacings of a marker array and general issues of setiuence het- 
erogeneity in our sttidies of DNA and itrtttein .setpiences letl us to statistic.al con- 
siderations of r-scaji lengths, the distances Itetween marker i and marker i -f r, 

t = 1,2,3, It Is interesting to charaterize the r-scan lengths harlxtring clusters or 

indicating regions of over-dispersion of the markers along the .se<|uence. Concretely, 
a typical objective is to determine the probability of successive {r -t- 1} markers 
falling within a DNA seciuence stretch under an ap|)ropriate stochastic model of 
the marker array. There are similar issues pertaining to sparseness of markers. Par- 
ticular markers (in the language of DNA, e.g.. s[)reific restriction sites, nucleosome 
placements, locations of genes) are distributisi over the genome along chromosomes. 
The r-scan analysis has been largely api)li<sl to the homogeneous Pois.son proces.ses 
for a marker array distributed over a long contig. It is known that the organization 
of mammalian genomes shows .sulistantial inhomogeneities, including “Ls<x-hores" , 
regions dominated by either C -1- G or A -h T DNA bas<‘ content. 

Here we consider an inhomogentHULs Pois.son process H on the real axis (0, oo) 
with an intensity A(.s), 0 < ,s < oc. The intensity function A(s) can lx- of dilfer- 
ent types, for example, periodic or constant in .successive intervals, depending on 
different applications. In this cotitext, we would like to determine the asymptotic 
distribution of the A-th minimum among the r-scan lengths over the intertal horizxm 
(0, t), as f — * oc. 

2. Preliminaries. Minimal r-scan lengths from a general distribution 

In the paper (3), the asym|)totic distribution of the fcth minimum r-scan length 
from a general distribtition ftmetion has lx>en studied by applying the Chen-Stein 

'Department of Mathematics, Stanford fniversity, Stanford, CA 9430.5-2125, USA. e-mail: 
kaxllnCaath . Stanford . edu 

Keywords and phranes: r-8<'an statistics, iiiltomogcm'ous Poisson marker array, asymptotic 
distributions. 
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method [4]. In that context, an r-scan process is generated following a piecewist? 
constant function or continuous general density f(x) with bounded support (0,7']. 
Thus let V\,V 2 , . . . ,V„-i be n — 1 i.i.d. samples drawn from the density /(x), 
and let T,* < Vj" < •■■ < be the order statistics corresponding to {K}. 

For convenience, let V],* = 0 and V’ = T. Then the associated r-scan fragments 
R, = V ’^^_ , — V‘_ ,,t = l,...,n — r-fl, and their order statistics R" are defined 
in the usual way such that R\ < R^ < • ■ ■ < For an extensive review of 

r-scan statistics, see the book [5j. 

From {ff, }, we define the Bernoulli random variables 

= 1’ if < a 

= 0, if /f. > a 


and their sum 

n— r+1 
tsl 

Denote by m.n,k = The asymptotic distribution (as n — ► oo) for rrin^k is as 
follows. 

Lemma 1. For a given positive constant p, let an be determined to satisfy 

^ = (2£jlL„ J [/(z)]’’+' dx. (1) 

Then we have the Poisson approximation 

Po(fi)) =0, 

for Po{p) the Poisson distribution with parameter p. Here d{ ,) is the. total vari- 
ational distance between two random variables defined by 

d{U, V) = sup [Pr[U e - Pr{V € /!}]. 

A 

Moreover, the kth minimal r-scan length, possesses the asymptotic distri- 
bution 

k-\ j 

lim Pr{m„jt > a„} = 51 (2) 

n— »oc ^ ^ l! 

t-0 

Proof of the above lemma is given in (3), Section 8. Here, by adapting the fore- 
going result, we will determine the asymptotic distribution of the fcth minimal 
r-.scan length corresponding to an inhomogeneous Poisson process in (0, (), as 

t — * 30. 


3. Minimal r-scan lengths for an inhomogeneous Poisson process 

The asymptotic theorem for the minimal r-scan length will be derived from the 
distributional property of Nf{a), where N^{a) is the number of r-scan segments 
of lengths < a over the interval horizon (0, t). It is clear that if fVr(“) < the kth 
minimal r-scan length mt,k exceeds the level a. Thus if the Poisson approximation 
holds for N^(n), we can access the asymptotic law for mt,*. Here the r-scan process 
of interest is generated from an inhomogeneous Poisson process II with an intensity 
function A(s), 0 < s < oo. The main theorem is as follows. 
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Theorem 1. Assume A(s) defined for s > 0 satisfies 

f X{s) ds — • oc. as t — • oo. (3) 

Jo 

For a given positive constant /i, let at be determined to satisfy the equation 

,1 = ’^ f y^'{s)ds. (4) 

Jo 

Then we have the Poisson approximation 

Urn d(Nr{at),Po{it)) =0. 

*oc ' ' 

Moreover, the kth minimal r-scan length, ni/.j-, possesses the asymptotic distri- 
bution 

k-\ i 

Urn Pr{m,.jt > a,} = 

(— oo l! 

1=0 

Proof of Theorem t. If ni denotes the point count of the Poisson Process fl in (0, 
t), then 

£[ri(] = f \{s)ds, Var(rii) = f X{s)ds. 

Jo Jo 

For convenience, let hi = [E[n,]J. Thus the Berry-Esseen estimate assurtrs 




Therefore 


d(/Vr(a,),Po(p)} < d(N-_,^t(at).Po(,i))+0 


(/W) 


If Hi = h(, the hi points iir (0, f) are distributed independently according to g(x), 
with 

g(x) = — — , 0 < X < /. (5) 

f„ A(x)dx 

Following the result of Leinina 1, we have 

jiti^(f(,V'_,,+i(a„),Po(/i)) = 0 
for 


Since 


a = -I 

hi^ f A(.s) 
Jo 


( 6 ) 


ds — * oo. 


we replace n with fit in formula (6) and g{x) with — to verify e(|uation (4). 

J„ A(i)di 

On this bases, we olitain 


^lim d(iV, (a,), Po(p)) = 0, 


Copyrighted material 


290 


5 . Karlin and C. Chen 


with 


at = 


r!/x 


This completes the proof of Theorem 1. 


4. Examples 

Haemophilxis influenza is a bacterium which engenders an infection in the lungs 
of humans [6]. The study of the USSs (uptake signal se<iuences) AAGTGCGGT 
(USS+) and its inverted complement (USS-) in the H. influenza genome (length 
of 1.83 Mb, Rd strain) provides opportunities for characterizing global genomic 
inhomogeneities. The result of homogeneous r-scan tests for r = 1,2, ... ,6 shows 
a significant even spacings between the markers such that the USS+ and USS- are 
remarkiibly evenly spaced around the genome such that both USS+ positions and 
USS- positions have respective minimum spacings higher than expected by chance 
with probability 0.001. This rare possibility may suggest that the homogeniety 
iissumption doesn’t fit the real distributions of the markers and an inhomogeneous 
r-san test should be applied for the marker array. 

Another example is the distribution of the palindrome GGCGATCGCC labeled 
HIPl (highly iterated palindrome) in the genome of the organism Synechocystis 
(3.6 Mb). Synechocystis is thought to be the evolutionarj' precursor of vascular 
plant plastids (7). The photosynthetic endosymbiont became dependent on host 
genetic information for maintenance and evolved into an organelle specialized for 
COo fixation. The r-scan analysis of the genome shows in this case a significantly 
even distribution. The observed minimal /-scan spacing is 52 bp (base pair) which 
is much larger than the threshold of 9 bp with the probability of 0.001. Similar 
conclusions applj' to the r-scan tests of r = 2, ..., 6 . The even spacing of HIPl in 
Synechocystis is more dramatic than the situation of USSs in H. influenza. 
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Abstract: This paper considers a problem of estimating an unknown sym- 
metric region in based on n points randomly drawn from it. The domain 
of interest is characterized by two parameters: size parameter r and shape 
parameter p. Three methods are investigated which are the maximum likeli- 
hood, Bayesian procedures, and a composition of these two. A modification of 
Wald's theorem as well as a Bayesian version of it are given in this paper to 
demonstrate the strong consistency of these estimates. We use the measures 
of symmetric differences and the Hausdorff distance to assess the performance 
of the estimates. The results reveal that the composite method does the best. 
Discussion on tbe convergence in distribution is also given. 


1. Introduction 

It Ls a pleasure to write this article for Professor Rubin’s Festschrift. I cannot begin 
to enumerate the things I have learned from him, and the number of times I walked 
into his office or he walked into mine, drew up a chair, and started a conversation, 
and opened my eyes. This paper itself is a prime example of liow much 1 benefitted 
from him in my student days at Purdue. 

In biology, the size and shape of home range within a community of a species 
of animal are often a starting point for the analysts of a social system. In forestry, 
estimating the geographical edge of a rare species of plant based on sighting of 
individuals is an important Issue as well. The need to estimate an unknown domain 
by using a set of points sampled randomly from it can also be seen in many other 
disciplines. See Macdonald et al. (1979), Seber (1986, 1992), and Worton (1987). 

If one considers the shape of the unknown domain an infinite-dimensional para- 
meter, the convex hull of the sample will be the maximum likelihood solution. Most 
of the literature hence focuses on the studies of the convex hull and the results are 
all for one dimensional and two dimensional regioiLs. Refer to Ripley et eil. (1977), 
Moore (1984), and Braker et al. (1998). 

However, if we iLse these results in some other applications, for example, recog- 
nizing the valid region of predictor variables, which iLsually involves more than 
two dimensions, we will then encounter some difficulties in implementation. As the 
dimensionality rises to higher than three dimensions, where a simple visual illus- 
tration is impossible, describing the convex hull of a sample becomes much more 

* National Cheng-Chi Univemity, 64, Sec. 2, Zhi-nan Rd.. Wenshan, Taipei 116, Taiwan, Re- 
public of China, e-mail: vctsaifinccu.edu.tv 
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difficult. Hence, a more practical approach for estimating a higher dimensional do- 
main is necessary. Due to this consideration, we would like to characterize the shajre 
of a domain by a finite-dimensional parameter rather than using a non-parjunetric 
model to which most literature is devoted. Besides, it is easier to establish proper- 
ties of estimates of the set of interest under parametric modelling. This would make 
us more comfortable using these estimates. 

Since the configuration of a roughly spherical object is easier characterized, we 
would like to start our investigation with a particular family of sets, the ip balls, 
because of their richness in fitting roughly rounded objects and in deriving pilot 
theoretical inference. 

Let Sp.r denote the centered ip ball with radius r with respect to the metric 
induced from p-norm in the k-dimensional Euclidean space, namely 


Bp.r = {x e : ||x||p < r}, 


( 1 ) 


here 


\ max(|xi|,...,|x/t|) 


when p is finite, 
when p is infinite. 


We call II ■ lip the p-norm operator. The unknown set S we wish to estimate will Iw 
assumed to l>e an ip ball; namely S = Bp„,ro for some 0 < po < oc and 0 < tq < oc. 

Notice that in our approach, the center of symmetry of the domain S is as.sumed 
to Ik; known. This will not be exactly true in i>ractice. A short discussion is given 
in the last section. 

Also notice that, when the dimen-sion k equals one. the family of ip balls be- 
comes the family of closed intervals |-r,r] in the real line. Our one dimensional 
version of estimating an ip ball ran be viewed as the well known “end-point” prob- 
lem: estimating the end iroints a and 6 by using points randomly selectcsl from |a, 6|. 
Also, p does not play any role in characterizing the set S which we wish to estimate 
when k = 1. Therefore throughout this paper, we will take k > 2. However, the one 
dimemsional case often lends much intuition to the case of higher dimensions. 

Now let xi = (xu, . . . ,xijt)', . . . ,x„ = (x„i, . . . ,x„/t)' denote a realization of 
It points from the domain S'. We would like to estimate S' by iLsing these olrserva- 
tions xi, . . . ,Xn. We will assume that xi, . . . ,x„ are independently uniformly drawn 
from 5. It is possible in practice that ii,...,i„ are independently drawn from S 
not uniformly but following a measure p on other than the Lebesgue measure, 
truncated to S with finite i.e. xi, . . . ,x„ ‘Lf*’ There will be no problem 
in deriving similar results which we establish in this article if p is knowm and for 
which Bp,r is identifiable. However, if p is miknown, estimating S becomes much 
more difficult. The reason is that we will be unable to distinguish between a rare 
event (e.g. the density with respect to p at a point x is small) and a null event (e.g. 
point X is not in the support 5); see Hall (1982). 

To summarize, we have taken an interesting problem and analyzed an interesting 
parametric model. We have given two very general results on strong con-sistency, 
and additional results on weak convergence as well as practical evaluation by very 
detailed numerics. We have indicated how to po.ssibly atldrtss more general cases 
and commented on application. These are the main contributions. 


2. Estimation 

As the domain 5 which we wish to estimate is characterizerl by parameters p and r, 
a plug-in method can be used to estimate S. We will consider thrw natural methods 


Copyrighted material 



Weak limits and practical performance of the ML estimate 


293 


of estimation for p and r: maximum likelihood method, a Bayesian approach, and 
a combination of these two methods. 

The maximum likelihood estimates have a drawback that they underestimate 
the volume of the true set with probability one and the magnitude of this bias is 
difficult to evaluate. The Bayesian approach does not have this underestimating 
problem. However, they are hard to calculate. That is not uncommon in a Bayesian 
analysis. An alternative approach which combines the imiximum likelihood estimate 
and the Bayesian approach is therefore proposed. This approach treats the volume 
of the true set as a parameter and estimates it using a Bayesian method. Then 
it corrects the maximum likelihood estimates for their biases accordingly. We are 
excited about this approach. 

Let us now look at the maximum likelihood method in detail first. Recall that 
are uniformly drawn from S. Thus the likelihood function of p and r is 

L(p,r\x\, . . . ,X„) = ^ l{(p.r):x,€np.r V i=l,...,n}; (2) 


here A is the Lebesgue measure. The formula for the Lebesgue volume of Bp^r is 


A(B„,0 = 2''>‘ 




r(i + f) ’ 


( 3 ) 


(see Gradshteyn and Ryzhik (1994), p. 647). If we denote the maximum likelihood 
estimate of (p, r) by (pm/ei f‘m/e)» then we have 

(pm/c,rmic) = argmaxL(p,r|xi,...,x„) = arg min A(B„,r). (4) 

(p,r) ~ ~ {(p,r):i,eBp.rVi=l,...,n} 


Moreover, as is an increasing function of r for any fixed p, {pmie,rmic) must 

satisfy 

fmie = ll?i|lp„„. 

!<»<n 

and hence 

Pmie = arg max 
p 

The profile likelihood of p mostly appears to be unimodal and therefore it is usually 
not difficult to obtain pmie and rmie numerically. 

Despite this easy characterization of the maximum likelihood estimate, there 
is a disadvantage in using this estimate. Consider the end-point problem. Suppose 
xi,...,x„ are iid Unif([a,b)). It is well known that the maximum likelihood set 
estimate of [a, 6], [x(i),X(„)], is always contained in the true interval. And therefore 
the length of the estimated support [x(i), X(„)] is always shorter than the true length 
b — a. Similarly, when the dimension k > I, the volume of the maximum likelihood 
set estimate Bp,^^^r„u always smaller than the true volume X{Bj^^ro)- The reason 
is that the maximum likelihood set estimate is the Lp ball which possesses 

the smallest volume among Lp balls containing all the observations. On the other 
hand, the true domain evidently contains all the observations. Therefore, we have 

M^Pmtr,rmlr) ^ 

Here we would like to point out that unlike the end-point problem (or the 
nonparametric setting) where the maximum likelihood interval estimate is always 
contained in the true interval, the maximum likelihood set estimate does 

not need to be inside the true set all the time. 

Now let us move to the Bayesian approach. VV^e will choose the loss function 
being 

/a(5,5)=A(5A5) (5) 
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where A denotes the symmetric difference operator. If we denote the prior of (p, r) 
by 7T, the posterior of (p, r) after observing xi, . . . , is 

Tims the Bayesian estimate btused on the loss function (5) is 

iPbayr.si'^baycs) ~ ^^8 ■^»r(p,r|xi ) (^(^p,r ABp,f)) • (b) 

(?.»•) 

Thongli we are able to show tlieoretically that (pbaycs^i^bayes) is strongly consis- 
tent and does not have the underestimating problem like (pm/o^mic) does, however 
the computation of {pbaycs^i^baycs) is difficult. The reason is that we do not have 
a formula of A(Bp,r ABp,f) for any two general Bp^r and Bp^f unless Bp^r C Bp^f 
or Bp^f C Bp^r- So, in general, it seems we have to approximate numerically the 
Bayesian estimate. This is a formidable numerical problem and indeed we are not 
sure that a minimizer reported by the computer can be trusteri. 

Therefore an alternative ajjproach is introduced to fix the drawback of the max- 
imum likelihood set estimates which always underestimate the true volume and the 
di.sadvantage of the Bayesian estimates which have computational difficulty. The 
alternative approach tries to estimate the true volume using the Bayesian method, 
and then corrects the maximum likelihood estimate for bias, based on the estimated 
volume. 

If we consider the loss function 

WS.S) = |MS)-A{S)|, (7) 

it can be analyzed easily. One notes that it only gives a penalty for inaccuracy of 
volume estimation. Therefore it provides us with only a decision on the volume 
of S. The following proposition characterizes the class of all Bayesian estimates in 
this situation. 


Proposition 1. Let Xi, . . . ,Xn be a random sample from Bp^r- Define the transfor- 
mation v{p, r) = X{Bp^r) nnd denote a median of posterior of v{p,r) by v„,. Then 
all the Lp balls until volume are Bayesian estimates under the loss (7). 

Proof. Let us denote by 7 t(i;|xi, . . . ,x„) the distribution of v = v{p,r) = A(Bp,r) 
with (p,r) having distribution 7r(p, r|xi, . . . ,x„). The risk 


P(P» — ^7r(p,r|xi,...,x„) (l'^(^p,r) -^(^p,r)l) ^ 7 r(w|xi,...,x„) (|^ ^(P>^)l) (^) 


which depends only on v(p, f) and is minimized when v(p, f) equals u,„. Namely Bp^r 
is a Bayes estimate with respect to loss (7) for any (p, f) for which X{Bp^f) = Vm- O 


As there are infinitely many Lp balls with volume u,„, we need a criterion to 
help us to choose one among thes(? as the estimate of S. A reasonable way to choose 
a specific Lp ball as an estimate of 5 could be the pair (p, r) that has the smallest 
Euclidean distance from {pTnir.,fmie) among the infinitely many pairs implied in 
Proposition 1. Thus, this composite approach is to find 

{Pcomb, fcomb) = arg min (p - Pmicf + {r- (9) 

{ ^p.r) — } 


We characterize {pcomb, Vcomb) below. It is nice that the characterization is cis ex- 
plicit as it turned out to be. 
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Proposition 2. Let Xi, . . . ,x„ be a random, sample from Then {p comb comb) 
in (9) exists. Furthermore, Pcomb is the unique root of 

p^ip - Pmir.) - r{p) {r{p) -f„Uc) =0 (10) 

and fcomb = fiPcomb)- Here tp is the digamma function and 


r(p) = 


v!J‘^ r(i + 

2 r(i + i) • 


( 11 ) 


Pi-oof. It is clear from (3) that for any fixed 0 < ;> < oo, r{p) is the unique solution 
in r of \{Bp^r) = fm- If we can show that {p — Pmie)^ + (t'(p) — Tmie)^ has a unique 
minimum at some p = p, then {pcomb, fcomb) = {p, r{p)). 

This follows on observing that H^(p„ur,r„ar)) — fmic < v,,, which implies that the 
point {pmicfmie) is uiuler the curve {p,r{p)) in the (p, r) plane. Furthermore, r{p) 
is strictly convex and differentiable, therefore, we have the existence and uniqueness 
of p, and it must satisfy 


(p - Pmic) + (r(p) - f„ac)r'{p) = 0. (12) 

By some further calculations, we obtain r'(p) = r(p)(^(l + ^) — tp{^ + p))^- 
From (12) it now follows that p is the unique root of (10). □ 


3. Strong consistency of the estimates 

Maximum likelihood and Bayesian estimates are the most widely used methods of 
estimation and there is an enormous amount of literature on it. However, a lot of the 
well known asymptotic theory applies only to those distributioiLs satisfying certain 
“regularity” conditions. See Lehman and Casella (1998), Le Cam (1953), Huber 
(1967), and Perlman (1972). One of the conditions requires that the distributions 
have common support. Apparently, we cannot look for answers in these theories for 
our problem, as the support is the parameter itself. Consequently, a more direct 
approach would be necessary and the Wald theorem would be the core key. 


3.J. Strong consistency of ML estimate 

Let us consider the maximum likelihood estimate first. The most popular strong 
consistency theorem for the maximum likelihood estimate is due to Wald (1949). It 
can be applied to the non-regular case. In his paper, Wald gave several conditions 
to prove a main theorem first. Then he establishe<l, essentially through this main 
theorem, the strong consistency of the maximum likelihood estimate (in fact, of a 
more general family of estimates) provided that the distributions admit those con- 
ditions. Though our problem does not satisfy Wald’s conditions, the main theorem, 
however, holds for our problem. Therefore, here we will try to combine his main 
theorem and his strong consistency theorem for our maximum likelihood estimate. 
For completeness, we provide the proof. 


Theorem 1 (Wald). Let Pq be a distribution uiith density f{x;6), where 0 € B. 
Suppose the realizations X\,. . . ,Xn come from Pq^, independently for some € B. 
Let On be a function of x\, ... ,Xn satisfying 


f {x I -,d„) ••• f{j^n',^ ti) ^ ^ n 

—ry* y—r > C > U 

f{xx,0o)---f{xn;Oo) 


for all n 


and X\,. . .,x„ for some positive, c. 


(13) 
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If for any given neighborhood of 9o, say U , it also holds that 




POn S 




then we have 


oo f{xj;0o)---f{xy,0o) 
I\ I lim = 0o| = 1- 

- Kn—^oo * "’ ) 


= 0 ^ = 1 , 


(14) 

(15) 


This theorem basically states that if the likelihood ratio of 9 to 0o is uniformly 
small as 9 falls outside any given neighborhood of the true parameter 9q, then the 
estimate 9n must be close to 9o since bj- assumption its likelihood ratio to is 
always greater than or equal to c (which is greater than 0). 


Proof. This theorem does not require that the coordinates of are finite (note 
that the shape parameter p in our problem can be infinity). But we will give the 
proof for ^0 having finite coordinates only to avoid redundancy since the proofs are 
similar. 

To prove (15), it suffices to show that for any neighborhood of 0 q, say f/, 9„ will 
fall inside U eventually with probability one. But from (14), one sees that, with 
probability one, there exists N, which may depend on such that 

saPoeext/ /(?i ^ g) ' ' ' 9) c ^ 

f(xi,9o)---f(x„,9o) ^2 

However, (13) claims that 


f(xi,9„)---f(x„,9„) ^£w 

f(xi,9o)---f(x»,9o) 2 


and xi,...,x„ 


Tims, 9„ ^ G \ U when n > N. Therefore 9„ belongs to U eventually with proba- 
bility one, as clainuxl. □ 


Since a maximum likelihood estimate, if it exists, obviouslj^ satisfies (13) with 
c = 1, this theorem al.so jjroves the strong consistency of the maximum 

likelihood e.stimate provided (14) holds. Fortunately, our family of distributions 
{Unif(Bp.,.)}{o<p<oc,o<r<oc} satisfies (14). 

Lemma 1. Let Pq denote Unif(B,,,r), whe7'e 9 = (p,r) and 9 € & = i(p, f’) : 0 < 
p < oo, 0 < r < oo). Then {Ps}oe*s satisfies (14). 

Proof. The proof is extremely lengthy and involved. To maintain the flow of this 
paper, we will only give a rough sketch here and refer the rigorous proof to Tsai 
( 2000 ). 

The basic idea of this proof is as follows. For any given (p, r) ^ (po,»’o), one has 
either Ca Bp.r or A(Bp„,ro \ ^;>,r) > 0, here A C\ B means A is contained 

in B properly in the Lebesgue measure; i.e. A C B and X{B \ ^4) > 0. In the first 
situation, we will have the likelihood ratio equal to ( )" "'Inch goes to 0 as n 

goes to oc since \{Bp^,^ro) < HPp.r)- For the second case, we will, eventually, observe 
some Xi not belonging to Bp^n which results iu the zero value of the likelihood ratio. 
As a re.sult, (14) shall hold. □ 

Now by Theorem 1, and Lemma 1, we have the strong consistency of {pmic, rmic)- 

Corollary 1. Let Xi,...,.t„ be a random .sample from Bp^r- Then the maximum 
likelihood estimate (p„Uf;, ^mk) strongly consistent. 
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3.2. Strong consistency of Bayesian estimate 

Let us now move to the consistency of the Bayesian estimate. The following is a 
general result on the strong consistency of the Bayesian estimate under a general 
assumption on the distribution family and the loss function. Basically, this theorem 
and its proof are very similar to the Wald Theorem given in the previous section 
except that we have to include the prior and the loss which are the other elemen- 
tary components for Bayesian analysis. The generality of this tluHirem makes it an 
attractive result of independent interest. 

Theorem 2. Suppose. Pg denotes a distribution urith density f{x;6), where 0 GB. 
Assume the observations xi,...,Xn are iid with probability Pg^ for some 0q € B. 
Let n{6) be a prior of 6 and l{0,6) be a loss function such that 

I 7r(0d0<oo and j l{6,0o)Tr{0Jdd < oc. (16) 

Vb ' ./b ' * 


Then the Bayesian estimate will converge to 0o with probability one {under Pg^) 
provided that for any neighborhood of Oq, say U , there exist sets IT C V C U 
satisfying 


- \n-c» mfoevv/(?i;0) •••/(?„; 0) / 


1 , 


(ii) 

/ 7T(0)d0; 


Jw 

(iii) 

inf l{0 

0€t/<=,0€V ' 


Remark 1. In this theorem, there is a condition on all components of t he problem 
(likelihood, prior, and loss). Condition (i) states that the likelihood ratio for 0 far 
away from 0 q versus 0 near 0 q is uniformly small. Condition (ii) rerjuires that the 
prior puts a positive mass around the true 0 q. Condition (iii) says that the lass 
function does punish for bad decisions. These conditions are all (juite mild. 


Proof. We divide the proof into several stejxs for clarity and ease of understanding. 
Step 1: Let its denote the posterior of 0 given xi, . . . ,x„ by 


7t(0|xi,...,x„) a n"^i/(x,,0) 7 t(0). 


Then the posterior expected loss for decision 0 is p{6) = E^^g\x^ x„) ^/(0,0)j , and 

the Bayesian estimate is 6bayc» — p(0). 

To prove the strong consistency of Ogayesf it suffices to show that for any neigh- 
borhood of 00, say U, dbaycs will fall inside U eventually with probability one (under 
Pso). Now, let V, IT, and c be as defined in condition (i), (ii), and (iii). We will 
.sliow that 

Pflo I inf P(^) > P(^o) + 7 ^ eventually 1 = 1. (17) 

' ‘ 4 J 


This will imply 




*00 1 nrg min p(0) € U eventually 

'I 0 ■ 


1 = 1, 




Ijroving this theorem. 


Copyrighted material 


298 


W.-C. Tsai and A. DasGupta 


Step 2: In this step, we will break p(9) — p(6n) into several terms whose magnitudes 
are easier to investigate. Note that 

p(e) - p(0„) 

= £.(«|x, .„)(/(?, 

- /(g,go))n",,i/(x.|g)7r(g)de 

/Bnr=,/(x.|?W?)d? 

Jy(i{0,o) - i{0,qo)W^J{x.\eM0)de n"^j(x,i0)n(0)<w 

fv n"=,/(?.l?)’r(?)d? ■ /« nr=,/(x.|fl)rr(0)d0 

/^J(g,g)nr,,i/(x.|g)7r(g)dg /^,./(g,go)n;’,i/(x.|g)tr(g)dg 

/« n:-^,/(x.|g),r(g)dg /y n;-,^,/(x,|g)7r(g)dg 

= (/).(//) + ((///) -(/V)). (18) 


Stop 3: In this step, we will show that (1) is always greater than or c<|iial to f. 
FYom condition (iii), it is easy to .see that 



/v ‘•n'L,/(x.|g)rr(g)dg ^ 

(^) ^ r — i-/ /X.X ./I = € for all 0 € U . 

/v' n."=,/(xi|g)rr(g)dg 

(19) 

Step 4: Now, 

we claim 


Note that 

/’?„{(//) ^U = 1- 

(20) 


^ n:Li/(xjg)7T(g)dg ^ J^,n;L./(x.|g)7r(g)dg 
/„n;>^,/(x<|g),r(g)dg - /„.nc.,/(x,|g)T(g)dg 

< »UP»6V« /(xi;g) ■ • /(Xnig) /yc 7T(g)dg 

inf«gw/(xi;g)---/(x„;g) /n-tr(g)dg' 

FVom condition (i), together with condition (ii) and (16), we get that the upper 
bound (21) converges to 0 with probability one. Consequently, claim (20) is proved. 
Step 5: Now let tis look at the term (III) — (IV). We would like to show that 


I inf {(III) — (/K)} > eventuallvl = 1. (22) 

- Uey 4 J 

Since (III) is noimegative, we have (III) — (IV) > —(IV) which does not depend 
on g. Moreover 

„ , ^ supggyc /(xi; g) • ■ • /(x„; g) V 1 ( 0 , gu)tr(g)dg 

inf«eH /(xi;g)- -/(x„:g) ’r(g)dg 

Again from conditions (i) and (ii), and (16), we get that (IV) converges to 0 with 
probability one. Therefore (22) is true. 

Step 6: Finally, as an immediate consequence of (18), (19), (20), mid (22) together, 
we obtain (17). This theorem therefore follows. □ 


Now we would like to apply Theorem 2 to our problem. The following lemma 
says that the distribution family Unif(Bp,r), and the loss function f((p,r), (p,f)) = 
A(Bp.r A/Jp.f) satisfy condition (i) and (iii) of Theorem 2. 
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Lemma 2. Let Pg denote the distribution Unif(B,,,r), where 0 = (p, r) and 9 € 
© = {(P'’') : 0 < p < 00 , 0 < r < oo}. Let l((p,r),(p,f)) = A(Bp,rABp,f) be the 
loss function and let ir be the prior on 9. Suppose 9g = (pn,ro) is a fixeA point in 
0 and n is positive in some neighborhood of (po, ro). Then for any neighborhood 
of (Poi^o), say U , there exist sets H' C V C U such that the conditions (i), (ii), 
and (iii) in Theorem 2 hold. 

Proof. The idea of the proof is not difficult. However, the proof is very lengthy. 
Refer to Tsai (20(H)). □ 

Now, as an application of Theorem 2. we have the strong consistency of 
(P6apr«,r6ov«) as follows: 

Corollary 2. Let be iid with distribution Unif(Bp.r). Suppose the true 

value of {p.r) is denoted by (pu.co). Let -n be a proper prior on (p,r) such that tr 
is positive in some neighborhood of (po,ro). Assume also that B„(p r)(A(Bp,r)) is 
finite. Then the Bayesian estimate under the loss l((p,r),{p,f)) = A(Bp,rABp,#) 
converges to (po,ro) with probability one. 

Proof. FVom the assumption on tr, one has 


Pwip.r) m(PT (P) ^))1 ^ Plt{p.r) [A(Bp.r) + A(B,,,,.ro)] < 00. 

Thus, the corollary follows from Theorem 2 and Lemma 2 immediately. 


□ 


3.3. Strong consistency of combined estimate 

Now we discuss the strong consistency of a combined estimate (pcomb. rcoms)- Recall 
that it is the pair {p.r) closest to the initial guess {pmU. hmu) with equal to 

v„,, the posterior median of t) = A(Bp.r). FVom Corollary 1 and Corollary 3 below, 
{PmU.fmu) and are both strongly consistent in the respective parameters. One 
may expect, therefore, that the combined estimate will be strongly consistent as 
well. We give a general theorem in this direction below. Again the generality makes 
it an appealing theorem of independent interest. 


Theorem 3. Let xi,...,x„ be a sample, from a distribution Pg, 9 € 0. Let 0 be 
a metric space with a metric d. Let 9„ and 0„ be functions of the observations 
Xi,...,x„ such that 9„ and converge almost .surely to 9 and B(9), respectively, 

. <fc/ 

under Pg, where 0{9) is a function of 9. Suppose 9„ = argminjg,gjjj_^ j d(9„,9) 
exists and is unique. Then 9„ converges to 9 with probability one if for any f > 0, 
there exists a neighborhood of j3{9) contain^ in /3(Bj(9,e)), where Bg{9,t) is the 
e-ball centered at 9 with respect to the metric, d. 


Proof. To prove the strong consistency of it is enough to show that for any 

e>0. 


Pg I d{9,9„) < 3f eventuallyj = 1. 


(23) 


By assumption, there exists a neighlmrhood of B(9), say B, contained in B(Bd(9,e))\ 
so, if Bn G B, there exists 9 within < distance of 9 such that (3(9) = B„. Then, one 
has 


d{9„,9„)= min d{9„,9) < d{9„,9) < d(9„,9) + d{9,9) < d{9,9„) -t- e, 
- - {«:P(e)=3-> 


Copyrighted material 


W.-C. Tsai and A. DasGupta 


.MK) 


which im])lie.s 

diejn) < d{eX) + d{dnA) < 2d{dX) + c. 

Furthermore, if d{9,9n) < then we have d{9,9n) < 3e. On the other hand, 9n and 
/3„ are strongly consistent for 9 and P{9) respectively. This implies 

Po I d{9,9n) < € and G B eventuallyj = 1. 

This proves (23) and hence the theorem. □ 

To apply the above general theorem to our problem, we need the strong con- 
sistency of v,n- This will be implied by the following theorem which generalizes 
Theorem 2. 


Theorem 4. Let x*i , . . . , a:,, be a sample from Pq with density /(x; 9), where 6^ G 0. 
Suppose we are interested in estimating a function 0{9) (rather than 9) itself and 
the loss is a function of 9 through (3{9), say l{0{9),$). Denote the true value of 9 by 
0() and the prior of 9 by tt. Assume f 7r(9)d9 < oo and J l{/3{9), /3{9o))TT{9)d9 < oo. 
Then the Bayesian estimate of (3{9), argmin^ E„^o\xi,...,xn)iK^{d), 0)), converges 
to Pq = P{9q) with probability one under provided that for any neighborhood of 
Po, say B, there exists sets W G V C. P~^(B) satisfying 


0 ) 

(ii) 

(hi) 


f sup<,ge\v/(?i ;?)•••/(?«;?) ] 

inf,€iv/(?i;0) •••/(?„;?) J ’ 

f n{9)d9 > 0, and 

Jw 

inf 1{P{9), P) — 1{P{9), P{9q)) > e for some e > Q. 

MI3<=.0€V ~ ~ ~ - - - - 


Remark 2. Theorem 2 is a special ca.se of Theorem 4 when we take P{9) = 9. 
Moreover, in this theorem, P does not have to be one-to-one and P~^{B) is defined 
as { € P}. 

Proof. The proof is exactly the same as that of Theorem 2. □ 

We now apply Theorem 4 to prove the strong consistency of . 


Corollary 3. Let xi,. . . ,x„ be a random sample from Bp^r- Define, v = v{p,r) = 
\{Bp^r)‘ Let 7T be a prior on (p, r) and Vm the posterior median of v. Let also (po, 7’o) 
denote the true value of (p,r). If tt is positive in a neighborhood of (po,rQ), then 
v,n converges to v(po,ro) with probability one. 

Proof. Denote ^(pcnr-o) by vq . Let P be a neighborhood of vq . Without loss of 
generalit}', we can assume P = (vo — S, vq -h <5) for some <5 > 0. Since v(p, r) is a 
continuous function of (p,r), there exists a neighborhood of (po,^o)j say U, such 
that U C v~^(P'), where B' = (uq — Vq + |). Then by Lemma 2, there exist sets 
IT C V C U such that conditions (i) and (ii) in Theorem 4 hold. 

Furthermore, if (p, r) € V, one has u(p, r) G B' , which implies 


|t;(p, r) - u(po,ro)| < - 


and |u(p, r) 



for all V G B. 


This gives us condition (iii) of Theorem 4. 

The corollary, therefore, follows from Theorem 4 immediately. □ 
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Corollary 3 endows us with the strong consistency of v,„ needed to apply the 
general result of Theorem 3. We are now ready to prove the strong consistency of 

(Pcombi ^ comb ) • 

Corollary 4. The estimator (pcombi ^comb) defined in Section 2.3 is strongly con- 
sistent. 


Proof. To prove this proposition, we will apply Theorem 3 for the case when the 
true value, po, of p is finite. Wlien po is infinity, we will prove this jrropositiou 
directly. Recall that 

{Pcombi ^ comb) arg mill [(p Pmlc'i "f" (c r mlr) ] ■ 


Also Corollary 1 and Corollary 3 give us the strong consistency of (pmlc-fmic) and 
v,„ respectively. 

Case 1 Po = oo : By (10) and the fact that v„, > Pcumb must lie 

greater than Pmic As pmic converges to po = oc with probability one, so does 

i/fc ^ ) 

Pcomb- Furthermore, Proposition 2 also gives fcomb = . I”"? ■ Thas the 


strong coasistency of fcomb follows from the strong consistency of Promt and Vm 
immediately. 

Case 2 Po < oc : We will prove this case as an application of Theorem 3. For 
any given f > 0, let us take B = (A(Bp„,(r„_,)+), A(B,„„r„+,))- It is easy to see 
that for any b € B, there exists (ro - f)'*' < r < ro + f such that t’(po,r) - b 
and certainly the distance between (po, r) and (po, ro) is smaller than e. Therefore, 
the assumptions in Theorem 3 are all satisfied. This proposition for the case when 
Po < 00 follows. □ 


4. Discussion 

This section will first compare the performance of the maximum likelihood estimate 
with the combined estimate, especially when the sample size is small. Recall that the 
calculation of Bayes estimate is difficult. Then, some simulation and conjectures on 
the asymptotic distribution of the estimates will be given as, unfortunately, they are 
very hard. We end with a brief discussion for the case when the center of symmetry 
of the true set is unknown. 


4.1. Comparison of (pmtc,f,nic) and (promb, fcomb) 

We remarked that the combined estimate can be principally coiusidered as a dilation 
of the maximum likelihood estimate. Our simulation will try to examine: (i) in what 
fashion the combined estimate dilates the maximum likelihood estimate, (ii) if it 
indeed helps with regard to underestimation of the volume of the true set, and 
(iii) if the choice of the prior on p and r affects the performance of the combined 
estimate. 

The tables and figures referenced Im'Iow are ba.sed on a simulation of .size 750 with 
true (p, r) = (2, 1), dimension k = 2, and sample size n = 10. We consider three re- 
spective priors on (p.r). They are n\(p,r) = j)e~^re~’' , 7r2(p,r) = ^j?e~^re~'~ , and 
’t 3 (p, r) = respectively. We denote each of the corresponding combined 

estimates by (p combi. fcomb\), (Pcomb2.fcomb2), and (pcomta, rcon. 63 ), respectively. 

Table 1 gives the mean and the standard error of the volume of 

and their symmetric difference as well 
as their Hausdorff distances to the true set. This table shows that the volumes of the 
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Table 1 : The mean and standard error (in parentheses) from a size 750 simulation 
of the volume of the maximum likelihood estimate and the combined estimates with 
respect to three different priors on (p, r) and the symmetric difference distances and 
the Hausdorff distances to the true set. 


true (p,r) = (2, 1) 


A; = 2, n = 10 


(Pr r) = 

HBp,r) 

^xiBp^rx Bp^r) 

dH{Bp,r, Bp^f) 

{Pmlc^ mle ) 
ip combi 1 combi ) 
iPcomb2i ^ comb2) 
(Pcom63i combs) 

2.70519(0.294328) 

3.10145(0.321972) 

3.10321(0.332331) 

3.13368(0.344959) 

0.466593(0.290792) 

0.336655(0.200423) 

0.342807(0.201751) 

0.351907(0.202675) 

0.118659(0.077068) 

0.095912(0.065939) 

0.097780(0.066906) 

0.098770(0.065463) 




Figure 1: Scatter plots of {pmicPcotnb) and fVomb). 


combined estimates are much closer to the true volume (which is tt = 3.14159), but 
with a higher variance, than that of the miiximum likelihood estimate. Moreover, 
the distances, either one, of the combined estimates to the true set are about 20% 
to 30% less compared to the maximum likelihood estimate. It also appears that the 
selection of the prior does not affect the performance of the combined estimate very 
much. 

Figure 1 plots Pcombi against PmU and rcombi against r^/c • We see that the 
scatter plot of {pmic, Pcombi) is virtually the 45 degree line; (fmie^f'combi)s\ on the 
other hand, all fall above the 45 degree line. We have similar results for the other 
two combined estimators. So, may indeed be considered as if it was 

dilaterl from by enlarging only the radius r while keeping p esjsentially 

fixed at Pmic- This is interesting. 

4.2. Convergence in distribution 

In this section, some simulation and conjectures on the asymptotic distribution of 
the nuiximum likelihood estimate will be given. Figure 2 shows several scatter plots 
of (n{Pmie — p)yn{f„xic ~ ^)) witli p = 2, T = 1, aiid various sample sizes. We believe 
that when the true value of p is finite, {n{pmic —p),n{rmie — r)) converges to some 
nondegenerate distribution which puts all its mass in the half plane: { (ar, y) : y < 


Copyrighted material 


Weak iimiU and practical performance of the ML eatimate 


303 





Figure 2: Scatter plots of {n{pmie — p)i — r)) with (p, r) = (2, 1) an<i k = 2. 

Solid line is (n(t — p),n( - l)r), where t ranges from 0 to oo. 

Broken line is the straight line through the origin with slope ^(V’(1 + |)~V'(1 + ^))- 
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^{^(1 + ^) — + It is also obvious that the correlation of pjnie and fmie 

is negative. When p,nic. overestimates the true p, the corresponding fmie will then 
likely underestimate the true r, and vice versa. 

Figure 3 gives scatter plots of n{fmie — f)) for the case where (p, r) = 

(oc, 1). It seems that (\/ni^^),n{fmie—r)) converges to some nondegenerate distri- 
bution having support in the fourth quadrant. Interestingly, the convergence rates 
seem dependent on the true value of p. 

In fact, these conjectures were inspired by the case when one of the parameters 
(p or r) is known. A summary for the behavior of p„,ic when r is assumed to be 
known is given below. Similar results can also be derived for the case when p is 
assumed to be known. See Tsai (2000) for details. 

If we assume r is known, say, r = tq, the characterization of the maximum 
likelihood estimate of p becomes very simple. We are in fact able to give the exact 
distribution of p,„/c and therefore the weak convergence result for Pm/c* The idea of 
getting this result is very simple. Indeed this problem can be converted to an end- 
point problem if we consider the new random variables Zi = X(Bp^^^xt),ro)i where 
5pro(x<).ro smallest Lp ball containing Xj with radius tq. It can be" easily shown 
that 2i's are independently and identically distributed with value between 0 and 
the volume of the true domain and = maxi<j<„ 2 ,, whose asymptotic 

distribution is well known. Thus we have the following weak convergence result for 
Pmie when the true r is known. 

Proposition 3. Suppose xj, . . . ,x„ are iid from Unif(Bp,ro)» where 0 < ro < oo is 
known. Let G denote an exponential random variable with mean 1 . Then 
(I) when p < oo, 


n{pmie - P) 


--f( 




G, 


where xj) is the digamma function, and 
(II) when p = oo, 

1 D 


\/n— 


Pmlc 


12 

■K^k{k — 1 ) 


Vg. 


(24) 


(25) 


Remark 3. Note that when p < oo, interestingly, the asymptotic variance, 

+ ^ decreasing function of the dimension k. It appears that 

the curse of dimensionality does not show up in this problem. To the contrary, for 
estimation of the single shape parameter p, it is beneficial to have a large k\ 

“D 

Remark 4. In fact, ti(A(Bp„j,,ro) “ M^p,ro)) — * ~A(J5p,ro)G. If we divide both 
sides by the true volume, this expression tells us that the proportion of the uncom- 
mon part between the estimate and the true set (to the true set) converges with 
the rate n to an exponential distribution. It does not relate to the true set. The 
convergence rates of PmU, however, do depend on p. It is interesting that the speed 
of convergence of pjnie slows down from n to discontinuously as p changes from 
finite to infinite. We believe that this phenomenon is caused by the difficulty of 
“catching the corners” of a square, for example. This is also interesting. 


4‘3. Unknown center of symmetry 

In practice, the center of symmetry of the object usually would not be known. It 
then has to be estimated. In this section, we will have a brief examination of this 
situation. 
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t. 


3 3.5 4 


3.5 4 


n = 25 


n = 50 


6 

4 


6 

4 


3 3.5 4 






Figure 3: Scatter plots of n{f,„ic. - r)) with {p, r) = (oo, 1) and k = 2 for 

different sample sizes. 
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Figure 4: Visual display of the set estimate when the center is unknown. The region 
bounded by the solid curves is the true set, by the broken or the dotted curve is the 
maximum likelihood estimate with the center assumed to be known or estimated by 
the mean of the observations, respectively. The conspicuous circle is the estimated 
center and the dots are the obserrations. 
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Apparently, it is not easy to estimate the center together with the shape para- 
meter p and the size parameter r by using the maximum likelihood method. See 
Amey et al. (1991) for some calculations. Besides, the problem of undere.stimating 
of the volume of the maximum likelihood estimate in this situation will be more 
serious. Therefore, it may be preferable to estimate the center by .some other ex- 
ternal methods. We tried the mean of the observations, and the L 2 median (spatial 
median) (which minimizes Ei<i<„||xi — u ||2 over u). It turns out that the mean 
of the observations performs better than the L 2 median. Therefore, here we at- 
tempt to clieck how the estimate may be influenced if the center is unknown and 
is estimated by the mean of the observations. Figure 4 gives a visual comparison 
between the maximum likelihood estimates with center treated to be known and 
with center estimated by the mean of the observations. It can be seen that the 
shape of the estimates can vary very much depending on whether the center is 
known or estimated. But the estimate of the size parameter does not differ that 
much. Moreover, the volume of the maximum likelihood estimate with the center 
estimated by the mean of the observations can exceed the volume of the true set. 
When the realizations cluster to one side with some ol>servations appearing in the 
far opposite direction, apparently the estimate can miss the true set badly. Thert?- 
fore, constructing a better estimate for the center of symmetry of the true .set is 
important. 
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Abstract: The contact process and more generally interacting particle sys- 
tems — arc useful and interesting models for a variety of statistical problems. 

This paper is a report on past, present and future of research by the authors 
concerning the problem of estimating the parameters of the contact process. 

A brief review of published work on an ad-hoc estimator for the case where 
the process is observed at a single (large) time t is given in Section 1. In 
Section 2 we discuss maximum likelihood estimation for the case where the 
proTOSs is olwerved during a long time interval (0, t|. We construct the esti- 
mator and state its asymptotic properties as < — ♦ oo, but spare the reader 
the long and ttxlious proof. In Section 3 we return to the case where the 
process is olwerved at a single time t and obtain the likelihood equation for 
the estimator. Much w'ork remains to be done to find a workable approxi- 
mation to the estimator and study its properties. Our prime interest is to 
find out whether it is significantly better than the a<i-hoc estimator in Sec- 
tion 1. 

It was a joy to write this paper for Herman Rubin’s festschrift. To this 
is ailded the bonus that Herman will doubtless siilve our remaining problems 
immediately. 


1. Introduction 


The contact process was introduced and first studied by Harris (1974). It is dt^- 
.scribed as follows. At every time t > 0, every {>oint (or site) x in the d-dimensional 
integer lattice is in one of two possible states that we shall call infectttd and 
healthy. The proce.ss starts at time t — 0 with a non-empty set A C of infected 
.sites. At time ^ > 0, the state of the site x E. Z"^ will be indicated by a random 
variable ^/'(x), given by 




1, if site X is infected at time t 

0, if site X is healthy at time t . 


( 1 . 1 ) 


The function : Z^ —* {0,1} de.scribes the .state of the process at time t and 
^0 = 1,4, the indicator function of the set A. 

The evolution of this (0, l}-valued random field is described by the following 
dynamics, A healthy site is infected independently and at rate A > 0 by each 
of its 2d immediate neighbors that is itself infected. An infected site recovers at 
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Figure 1: The process for A = 3 and /i = 1 after 30,000 steps. Infected sites are 
represented by gray 1x1 scpiares. A darker gray level indicates a longer duration 
of the present infection. 


rate /i > 0. Given the configuration at time the proces.ses involved are in- 
dependent until a change occurs. For d = 2 the contact process is a simplified 
model for the spread of an infection or, more generally, of a biological species in 
the plane. The growth of a forest is an example if diseased and healthy are in- 
terpreted as presence and absence of a tree in a .square centered at the lattice 
.site. 

In Figure 1 we show the process that started with a single infected site at 
the origin with A = 3 and = 1 after .30,000 steps, i.e. 30,000 infections and 
recoveries. Infected sites are indicated by gray 1x1 squares. An additional feature 
of this figure is that for each infected site we have kept track of the number of 
stops since it was last infected and have indicated this by the gray level at that site: 
the darker the gray level, the older the present infection at a site. If we view the 
process as a model for the growth of a forest, then the gray level indicates the age 
of the tree. Obviously, the older trees are in the center of the picture away from the 
boundarj'. 

It is sometimes convenient to reprasent the state of the contact process at time 
t by the set of infected sites rather than by the function — + {0, 1}. Usually, 

this set is also denoted by Thus, by an abuse of notation, we write 

= = ( 1 . 2 ) 

Let 

=inf{« = 0} (1.3) 

denote the time the infection dies out with the convention that = oo if the 
infection survives forever. For a set C C and a > 0, we write aC = {ax : x € C}. 
For sets C and D in R^, C0D = {x-\-y : x € C, j/ € D} will denote their Minkowski 
sum and we define 


«/•= U «.'‘®Q|-l/2,l/21''. (1.4) 

0<s<t 
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Thus is obtained from the set of sites tliat liavc been infecteri up to and including 
time t by replacing each site by a hypercube with sides of length 1 centered at this 
site. 

The contact process has been the subject of extensive studies during the past 
decades. We list a few of its basic properties. 

Property 1. If p = A/p exceeds a certain critical value pj , then the infection will 
continue forever (i.e. = oo) with pasitive probability depending on the dimension 

d and the initial set A. This Ls called the supercritical case. On the other hand, if 
P < Pil, then the infection will eventually die out (i.e. r‘* < oo) with probability 1. 
We shall restrict attention to the supercritical case. 

Property 2. In the supercritical case, there exist |X)sitive constants C and 7 such 
that for every < > 0 and A C. Z"* with cardinality |,4|, 

P(t < < 00 ) < Cc-^', P(r-^ < 00 ) < (1.5) 

In particular, if A is infinite, then in the supercritical case the infection will survive 
forever. 

Property 3. The distribution of the set converges weakly to a limit distribution 

P(r-^ < oo)<5e + P(t'' = oo)j/, (1.6) 

where denotes the measure that assigns probability 1 to the empty set and 
is the equilibrium measure depending only on p and the dimension d. Thus, given 
that the process survives forever— which is possible only in the supercritical case — 
it tends in distribution to i/. Here weak convergence coincides with convergence in 
distribution of the finite dimensional projections n F, (i.e. {^(*(x) : x € F)) for 
finite F C Z'^. 

Property 4. There exists a bounded convex set U C fC' with the origin as an 
interior point such that for every bounded A C Z'‘, t > 0 and t — < 00 , 

(l-£)tf/C ///’ C(l+£)tf/, (1.7) 

eventually almost surely on the set {t'* = oc} where survives forever. Tims if 
the infection persists, then for large t, H''( will grow linearly in t in every direction 
and t~* will assume the sha]M’ of U. Moreover, on (he set {r'^ = 00 } and for 
large t. the distribution of U (1 — ()IU will approach its asymptotic distribu- 
tion under the equilibrium measure i' in a sense that we shall not make precise 
here. 

For these facts and other related matters the reader may consult Liggett (1985 
k 1999). 

The contact process and its many po.ssible generalizatiotis provide an interesting 
class of models for problems in spatial statistics and image analysis. In Fiocco & 
van Zwet (2003a & b) we Ix'gan a statistical study of the supercritical contact 
process that starts with a single infected site at the origin and is conditioned 
on survival, i.e. on = 00 }. For this process we considered the simplest possible 
statistical problem, that is, to estimate the parameters of the contact process ba.sed 
on observing the .set of infected sites at a single (large but tmknown) time I. This 
corresponds to the realistic situation when one observes a large forest that has 
obvioiLsly been there for a long time without any knowledge when it began. On the 
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basis of such an observation it is clear that one can only estimate p = X/p but not A 
and p individually, as without knowing t, one cannot distinguish between observing 
the processes with parameters cA and cp at time t/c for different values of c > 0. 
Equivalently, one may set p = 1 arbitrarily and estimate A. 

For any x,y € and C C R^, let |a: — ~ Vi\ denote the 

Li -distance of x and y, and define 

= (1 -^,(x)) (1.8) 


xecnz<i 


lecnz'' 


(1.9) 


Notice that nj^^(C) is simply the number of infected sites in C and equals 

the number of neighboring pairs of infected and healthy sites, with the healthy 
site in C. For x G Z'*, the flip rates at time t equal AA:j®^(x) and p(,l^\x) for the 
transitions 0 — > 1 and 1 — » 0 respectively and hence the number (C) of infected 
sites increases by 1 at time t with rate and decreases by 1 with rate n|”^(C). 

In Property 4 above, we explained that on = oo} and at a large time <, the 

process will have progressed past the set (1 — e)tU and will be close to equilibrium 
there. This implies that the rate of increase of n|^^((l —e)tU) should approximately 
equal its rate of decrease, so that Afc,^”^((l — f)tU) « /m|^^((l — e)tU). Hence on 
|t-{o} _ Ti|^^((l — e)tt/)/A:|*^^((l — e)tU) should be a plausible estimator of 

p = X/p, or of A if one aasumes // = 1, where it not for the fact that U is unknown. 
However, one can show that for every e > 0, the convex hull C(^/”^) of the set of 
infected sites satisfies 


(1 - e)tU C C (1 + e)tU, (1.10) 

eventually almost surely on = oo}, so that C(^/°^) apparently approximates 
tU. If, for any <5 > 0, we define 

c, = (i-«-)c(?,"»), (UI) 

then for some e > 0, (1.10) ensures that C (1 — c)tU eventually a.s. on = 

oo}. Hence 


Pt 




( 1 . 12 ) 


would seuin to be a .seasible estimator of />, given that the process will survive 
forever. Indeed we prove in Fiocco & van Zwet (2003b) that conditional on = 
oo}, pf is a strongly consistent and asymptotically normal estimator of p, that is, 
as f — b 00 , 


pt-*p a.s. \Ct\]/‘^(pt - p)^N(ti,T^) (1.13) 

Here \Ct\a denotes the cardinality of CtflZ*^, or alternatively, the Lebesgue measure 
of Cl , and an explicit expression for is available. For our purposes we merely note 
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that this implies that {pt — p) = on = oc}. Simulation confirms 

that the estimator behaves as predicted by the asymptotics (Fiocco (1997)). 

For the estimator pt to perform well asymptotically as well as in simulations, 
it is essential that <5 should indeed be positive in (1.11). At time t the process 
has spread approximately to tU, but beyond (1 — f )tU it is not yet in equilibrium 
and our argument fails. This is also intuitively obvious: having just reachetl the 
boundary of tU, the infected sites beyond (1 — e)tU should be less dense than they 
are closer to the origin where the infection arrived earlier and had time to achieve 
equilibrium. Beyond (1 —e)tU the fraction of infected sites should be too small, but 
among the infected sites the fraction with healthy neighbors should be too large. 
As a result pt should .systematically underestimate p 6 is taken to be zero and 
simulation not only confirms this, but shows that in this case the estimator is bad. 
This effect also shows up asymptotically as t » oo. If <5 = 0, we can still prove 
consistency but no longer asymptotic normality. Shrinking the convex hull 
to obtain the mask Ct for the estimator is essential for obtaining a satisfactory 
estimator. 

Two minor problems are left. First, shrinking towards the origin to ol> 

tain Ct is possible only if one knows where the origin is, i.e. where the infection 
has started. Generally this is not known: one sees the forest today, but not when or 
where it began. Of course one can estimate the origin in many different ways, for 
instance by averaging the locations of the infected sites. Shrinking towards this es- 
timated origin will not influence the asymptotic behavior of the estimator. A more 
elegant solution is to replace the shrinking of by another operation that 

removes the sites near the boundary of this set. Such operation is called peeling, 
where one removes layer after layer of sites on the boundary of the convex hull. 
In general, almost any reasonable type of shrinking will leave the asymptotic be- 
havior of the estimator unchanged as long as the same fraction of sites is removed. 
Simulation suggests that this fraction should be around 20-30%, decreasing with 
increasing t. 

Second, our analysis refers only to the behavior of the process - and hence of 
the estimator - on the set where = oo. Obviously, if < oo, there is not 
much to observe for sufficiently large t, since the infection will have dierl out. On 
the other hand, we can not know with certainty at any finite time t that we are 
indeed in the case where = cx>, .so one may wonder whether asymptotic results 
for < — ► 00 that are valid only on the set = oo have any statistical significance. 
However, (1.5) eiusures that having survived until a large time f, the infection w'ill 
survive forever with overwhelming probability. A.symptotic results conditional on 
= 00 are therefore the same as those conditioned on > t, that is, on the 
infection being present when observed. 

2. Maximum likelihood for the fully observed process 

Having briefly described the statistical results obtained for the contact process 
observed at a single time t, we now turn to the case where this process is ol> 
served continuously on the interval (0, f] for a known (large) t > 0. In this case 
it should be possible to estimate A and p separately, rather than just their ra- 
tio p = \/p. In fact we shall derive the maximum likelihood estimators of these 
parameters. 

Let 0 < Ti < . . . < Tyv denote the times when the contact process undergoes 
a change in the time interval {0,<] and, for i = 1,2,...,/V, let Xi denote the site 
at which the change occurs at time T^. It will be convenient to write To = 0 anrl 
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T/v+i = t and for the configuration of the process at time Ti. Given the 

configuration at time 7^_i, the rate of change at site x equals 

F if^,_i(x) = l, 

and tlie totid rate of change at any site is given by 

R.= Y. = -'*^7^- . (Z") + /•<! . (Z") (2.2) 

X6Z'‘ 

It follows that the likelihood of the observed process on [0, f] is given by 

fj /^^exp{-/?,[^i - Tj_iJ}(r,(a:j)//?,]exp{-/?;v+i[< - Tn]}. 

l<i<N 



Hence 


log L{\, /i) = - ^ R,\T, - Ti_i] + Ut log A + A log /I + 

l<i<N+i 


where Uf and A arc the number of upward and downward jumps of the process on 
[0,<], i.e. 

Ut = #{0 < i < - 1 : = 0} = #{1 < i < AT : = l}, (2.3) 

A = #{0 < i < N - 1 : = 1 } = #{1 < i < iV ; = 0}, (2.4) 

and /t(^^^^) depends on the process : 0 < .s < <}, but not on the parameters A 

and fi. 

Define 

At= f Bt= f ni^^Z^)ds. (2.5) 

Jo Jo 

As ni^^(Z‘^) and A-j^^(Z'^) are constant for s € [Tj_i,T,) and T/v+i = t, (2.2) 
implies that 

R.[Ti-Ti.i] = XAt+XDt 

l<i<N+l 

and hence 

log L( A, //) = - AA< - Ut log A + D, log /i + ) . (2.6) 


Differentiating with respect to A, and /x we find that the maximum likelihood esti- 
mators Af and jit of A, and are given by 






(2.7) 


The maximum likelihood estimator of p = X/p therefore equals 


Pt 


UjBt 

DtAt' 


(2.8) 
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As ill the previovLs section we can prove tliat conditional on = oo}, these 
estimators are strongly consistent and asymptotically normal, but converge to the 
parameter to be estimated at the faster rate Thus conditional on 

- 00 } and as t — * 00 , 


A, A 


fit — * fi 


Pt ^ p a.s.. 


(2.9) 


/(«'+i)/2(A, - A) - ^V(0,<T^), 

- /r) - N(0,al), 

-p) ^ Ar(0,<7*), 


( 2 . 10 ) 


again with expUcit expressions for the variances being available. The proof Ls long 
and involved and will be given elsewhere. 

There are two different ways of looking at thesr* maximum likelihood estimators 
heuristically. First we may observe that the counting process Ut has compensator 
AA, and since A, — » (» if = 00 , A| = Ut/Ai should approximate A. Similarly, 
pD, is the compensator of Dt and [it = Dt/Bt should approximate p on {tI**! = 
00 }. Hence Ai, pi and pt are plaasible estimators of A, p and p. 

However, one may also be interested in a comparison of the maximum likeli- 
hood estimator pt based on the fully oliserved process : 0 < ,s < f }, and the 

ad-hoc estimator pt of Section 1, which is based on observing at the sin- 
gle time t. We assume throughout that rl"l = 00 . First of all, {Bt/At) in (2.8) 
estimates the same quantity p as pt = nj'^^(Ot)/kj^\Ct) in (1.12). On the one 
hand, BtjAt averages information over the interval [0, f] and should therefore have 
a variance of a smaller order than n{“*(C()//r,*“*(C(). On the other hand BtjAt 
uses the entire set of infected points and its healthy neighbors, and we have ar- 
gued in Section 1, that without shrinking this set, this will lead to underestimating 
p. The factor UtjDt in (2.8) now serves to correct this negative bias. In equilib- 
rium. the number of upward and downward jumps should approximately cancel 
out, but near the boundary of the set of infected points, equilibrium has not yet 
set in. In fact, the numl>cr of infected sites (/( — Z)( + 1 grows roughly as a con- 
stant factor times the Lebesgue measure of tU , that is, at the rate of Individ- 
ually, both Ut and Dt are counting processes and easily seen to be of order t*'"''*. 
Hence (UtjDt) — 1 Ls j)ositive and decreases at the rate <“*, so that the factor 
UtjDt in (2.8) does serve to correct the negative bias w'hich does indeed decrease 
like f”‘. 

The asymptotic results (1.13) and (2.10) imply that the estimators pt and pt 
of p have random errors of orders 0(i~'^l^) and 0(f('*+*)/2) respectively. Hence the 
maximum likelihood estimator pt based on observing the entire process : 0 < 
s < <}, is asymptotically an order of magnitude better than the arl-hoc estimator pt 
liased on a single observation of . In Figure 2 we show a single run of simulated 
values of both estimators after 500, 1 , 000, 1 , ,500, . . . , 20. 000 jumps of t he process 
for A = 0.8 p = 1, and hence p -- 0.8. For the ad-hoc estimator, the shrinking of 
the convex hull of infected sites C((}°^) to obtain the mask Ct has been achieved by 
[jeeling rather than multiplication by (1 — rf) as Ls done in (1.11). Pwling fractions 
of 30%, 50% and 70% were itsed. It appears that the maximum likelihood estimator 
is indeed superior. 
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Figure 2: Maximum likelihood estimator pt and the ad-hoc estimator pf 


3. Meiximum likelihood for the singly observed process 

As we pointed out in Section 1, one will rarely have the opportunity to observe 
the process throughout a time interval [0, t). In most casas one will have to he 
content with a single observation of the process at a (large but unknown) time t. 
For the latter situation we reported on the study of an ad-hoc estimator pt = 
of p, and noted that it is essential to choose the mask Ct well 
inside the convex hull of the set of infected points in order to avoid underes- 

timating p. Of course, we are still interested in finding and studying the maximum 
likelihood estimator for this case, if only to see whether or not it will improve 
substantially on the ad-hoc estimator. 

Obviously this is going to be a difficult assignment. In Section 2 we studied the 
maximum likelihood estimator for the fully observed process and discovered two 
things. First of all this estimator uses the ratio of (the integrals of) and 

and we conclude that the use in the ad-hoc estimator was a good 

idea. Second, the bias correction was achieved by the correction factor Ui/Dt, which 
is a rather more subtle way to achieve this than by discarding a sizeable fraction 
of the data, as is done for the ad-hoc estimator. It therefore seems plausible that 
the imiximum likelihood estimator for the singly observed process will also depend 
on nnd that conditional expectations of the numbers of upward and 

downward jumps in [0, f] given will also play a part. 

Since we observe at an unknown time t and have no information about 
the times of any of the jumps, we may discard the time element entirely and 
view the process as a sequence of configurations ^.fter the 

first, second,. . . , (n — 1 -t- 2A)th jumps that take place consecutively at sites xi,X 2 , 
. . . ,x„_i+ 2 fc during the time interval [0, f]. The final configuration C,i_\+ 2 Jt equals 
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the oljsorved cunKguration 5/"*. For some k. (n - 1 + k) of the jumps are upward 
(i.e. nj"* increases by 1 at this jump) and k are downward. Hence the total increase 
— n^\z'^) = ti}”^(Z'‘) - 1 of the miinber of infecteri |x>ints nULst et|Ual 
(n — 1 + fc) — A- = n - 1, so tliat we miLst have n = Finally, we write 

and for the valui's of nj*’* and A,*”* after the time of the (i — l).st 

jump at the site Xi where the next jump will occur, and and 

for the values of n{***(2‘*) and kl"^(Z'^) immediately after the i — th jump. The 
probability of (n — 1 + k) upward and k downward jiunps consecutively at siti« 
Xi,i2 ,---, 2-„-i+2it equals 


n 




\kl2\{Z0) + ,ml'l\{Z«) 


1"), 


= A"- 




n 


A-,'5i(rt) + n,-'i(xi 


10 ), 


><.<n^+2. AA|_">(Z-') + prt.'!>(2'') 


10 ), 


because either A‘*”J(x,) or n,*"*(x,) vanishes. It follows that the likelihood is given 
by 


r(A,/t)= Y. 

0<ib<oo 




n 


A-*“',(x,) + n,'"',(x,) 


10 ), 


!<><n-l+2l AA,L|(2'*) + f^n\_\(Z'‘) 


10 ), 


(3.1) 


where denotes summation over all pos.sible sequences ^{’’*,$2*^*, ... for 

which b> the first configuration equaling and n = uj‘'^(Z'‘). As we 

noted in Section 1 we can only estimate p = A//t, but not A and p separately as / is 
unknown. However, we can still maximize the likelihood L' as a function of A and /t, 
but we shall find that lx»th likelihood equations are identical. If U and D denote 
the number of upward and downward jumps until the configuration equals for 
the first time, then differentiation with respwt to A and p yields the likelihood 
equat ions 


K(f/|^<">) = E 


V 

AA-b’',(Z'') 


2^ 

i<i<(/+/> 

.^kl"\(Z<‘) + pr,j2\(Z‘‘)^ 

1st 


(3.2) 


£:(DK<“>) = E 


f‘n\"\(Z'‘) 


i<S+o LAA-.'1'KZ'') + (Z-*) 




1 ") 


(3.3) 


Adding these two etiuations yields the identity E{U + = E{U + /A|^'"’), .so 

(3.2) and (3.3) are equivalent to the difference 


cC'l^ 


£(f/ - DKI"’) = £■ 




\kl^\(Z‘‘) - pnl°_\(Z‘‘) 


10 ), 


LAA-*°1(2'') + pn\°Mz*)\ 

and since U — D = nj^^(Z'') — 1, this reduces to 


i«; 


| 0 ) 
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= 1. (3.4) 


Even though thi.s last step removes the dependence on the conditional expectation 
of U — D, tills is no great help since the conditional behavior of U + D still enters 
through the range of the summation. 

Thus, as expected the maximum likidihood (estimator of p = \f\i presumably 
(lepends on both and the conditional behavior of V given 

Obviously there fu-e two different possibilities to study the maximum likelihood 
estimator, namely asymptotic approximation of the estimator and simulation. Work 
in the former direction is in progress. 
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On the “Poisson boundaries” of the 
family of weighted Kolmogorov statistics 

Leah Jager' and Jon A. Wellner^ 

University of Washington 

Abstract: Berk and Jones (1979) introduced a goodness of fit test statis- 
tic R,i which is the suprcmum of pointwise likelihood ratio tests for testing 
//o : F(x) = Fq{x) versus Hi : F(x) ^ Fo(x). They showetl that their sta- 
tistic does not always converge almost surely to a constant under alternatives 
F, and, in fact that there exists an alternative distribution function F such 
Rri —*d si*Pt>o^(0/^ where N is a standard Poisson process on [0, oo). We 
call the particular distribution function F which leads to this limiting Pois- 
son behavior the Poisson boundary distribution function for Rn. We investi- 
gate Poisson boundaries for weighted Kolmogorov statistic:s Dn(V’) for various 
weigl»t functions xjj and comment briefly on the history of results concerning 
Bahadur efficiency of these statistics. One result of note is tliat the logarith- 
mically weighted Kolmogorov statistic of Groeneb<x>m and Shorack (1981) has 
the same Poisson boundary as the statistic of Berk and Jones (1979). 


1. Introduction 

Suppose that Xi, . . . , are i.i.d. F on R and we want to test the null hypothesis 

// : F{x) = Fu{x) for all a; € R 
where Fo is continuous, versus the alternative hypothesis 

K : F{x) ^ Fo(a:) for some x € R. 

As usual, we can reduce to the case when Fo is the Uniforni(0, 1) distribution on 
[0, 1]; i.e. Fu{x) = x for 0 < x < 1. 

Berk and Jones (1979) introduced the test statistic F„, which is defined as 

/?,. = sup A'(F„(.r),Fo(a:)), (1.1) 

— OC<X<3C 


where 

K{x,y) =a:log- -I- (1 - x) log ^ (1.2) 
and F„ is the empirical distribution functions of the X^’s, given by 


S (1-^) 

" i=l 
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Define 


and 


A'+(x,j/) = 



K~{x,y) = 


0 < y < X < 1, 
0 < X < y < 1, 
otherwise 


K{x,y), Q < X < y < \, 
0, 0<j/<x<l, 

oo, otherwise. 


(1.4) 


Berk and Jones also studied the one-sided statistics and /2„ defined by 
/?,t = sup A"'^(F„(x),a:), R~ = sup A'“(F„(a;),a:). 


Berk and Jones (1979) discussed the optimality properties of the statistics 
and /?„. They showed, in particular, that they have greater Bahadur efficiency than 
the corresponding Kolmogorov statistics. Berk and Jones (1979) also extended this 
comparison to weighted Kolmogorov statistics via the results of Abraham.son (1967). 
In view of the results of Groeneboom and Shorack (1981), these comparisons are 
trivial for any weight funtion x}) of the form t/’(x) = [x(l — x))“^ for any positive b 
since Groeneboom and Shorack show that the limiting efficacy of the weighted Kol- 
mogorov statistics with power function weighting is in fact zero for any alternative 
for which the efficacj'’ makes sense. Moreover, as we show here the efficacies of the 
weighted Kolmogorov statistics are not well-defined (and the Bahadur efficiency 
comparLson is not meaningful) for fixed alternatives at or beyond certain “Pois.son 
boundaries” which we describe below. Thus it seems to us that the as.sertion by 
Owen (1995), at the end of his section 1, that the statistics of Berk and Jones (1979) 
have “increased efficiency over any weighted Kolmogorov Smirnov method at any 
alternative distribution” is an over-interpretation of the results of Berk and Jones 
(1979). 

Wellner and Koltchinskii (2003) present a proof of the limiting null distribution 
of the Berk-Jones statistic, and Owen (1995) computes exact quantiles under the 
null distribution for finite n; see also Owen (2001). Using these quantiles, Owen 
constructed confidence bands for F by inverting the Berk and Jones test, and 
then calculates the power associated with the Berk-Jones test statistic for fixed 
alternatives of the form F{x) = Fo(x)“. See Jager and W'ellner (2004) for some 
corrections of the results of Owen (1995). 

One of the interesting results for the statistic /?,, proved in Berk and Jones 
(1979) is the following limit behavior under a rather extreme alternative distribu- 
tion. 


Theorem 1 (Berk and Jones (1979)). Suppose that X\, . . . , Xn «re i.i.d. with 
distribution function F given by 


Then 


l + log(l/x)’ 


0 < X < 1 and 


()<b< 1. 



R 


n 


—*■ sup 

0<(<oo 


— > SUJ) 
0<t<oo 


N(0 d 1 

t U' 

m) d 1 
t u 


(1.5) 
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where N is a standard Poisson process on [0, oo) and U is a t/m/orm|0, 1) random 
variable. 

Because of the Poisson nature of the limiting distribution in Theorem 1, we call 
the corresponding alternative distribution function F a “Poisson boundary" for the 
test statistic R,,. The fact that sup,^oN(i)/< = \/U follows from results of Pyke 
(1959), page 571, and elementary manipulations, or, alternatively from the classical 
result of Daniels (1945) that 

P ( .sup G„(i)/i > * ) = l/i for X > 1 

\0<I<1 / 

where G„ is the empirical distribution fimction of n i.i.d. Uniform(0, 1) random 
variables (see e.g. Shorack and VVellner (1986), page 404) together with the Poisson 
convergence results of Wellner (1977b). 

For alternatives F that are “less extreme” than the F given in Theorem I, 
Berk and Jones (1979) give sufficient conditions under which following more usual 
or “expected" behavior holds: 

R'^ — • sup A'‘*'(f'(i),x), and R„ — • sup A'(F(i),i). 

os. at. J. 

Some questions related to this type of result are di.scus.sed further in Section 4. 

Our main purpose here is to note that the phenomena of a Poisson boundary 
is not unique to the Berk-Jones statistic /i,,, but that in fact this type of behavior 
holds for a general class of “weighted" type statistics. Indeed we will show that the 
Poisson boundary for the weighted Kolmogorov statistics is a much less extreme 
alternative than the Poisson boundary distribution function F (given in (1.5)) found 
by Berk and Jones (1979) for their statistic. 


2. “Poisson boundaries” for weighted Kolmogorov statistics 
Consider the family of weighteti Kolmogorov-Smirnov statistics given by 


D„(b) = .sup 

0<X<l 


|Fn(x)-x| 
(x(l -x))*- 


( 2 . 6 ) 


where F„ Ls the empirical distribution function of the Xj’s and 0 < 6 < 1. The 
asymptotic l>chavior of D„(6) under the null hyjtothesis H Ls well-known: for 0 < 
b< 1/2 


n^^^D„{b) —> sup 


|U(0I 


0<K1 (1(1 — O)*" 

where U Ls a standard Brownian bridge process, while for 1/2 < 6 < 1 


{ su 
o<« 


, |N(0-1| |N(0-l 

D„{b) -t max f sup , sup -r — 

I 0«<£» t 0<(<3C t 


where N and N are independent standard Poisson processes. The case 0 < 6 < 1/2 
follows from Chibisov (1964) and O'Reilly (1974); see e.g. Shorack and Wellner 
(1986), pages 461 466, or Csorgo and Horvath (1993), Theorem 3.2, page 217. The 
ca.se 1/2 < 6 < 1 follows from Mason (1983); see also Csorgo and Horvath (1993), 
Theorem 1.2, page 265. When 6=1/2 the limit behavior is due to Jaeschke (1979) 
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and Eicker (1979), which in turn rely on the classical results of Darling and Erdos 
(1956): 


b„n'l'^D^(b) -Cn^ El 

d 


where = (2 log log n)*/^, c„ = 2 log log n + (1/2) log log log « — (1/2) log(47r), and 
P(Ey < x) = exp(— 4e“*); see e.g. Shorack and Wellner (1986), page 600. 

Our goal here is to prove the following theorems concerning particular fixed 
alternative hypotheses. 


Theorem 2. 
a: < 1. Then 


Suppose that Xi, X^, • • • , X„ are i.i.d. F where F{x) = a** for 0 < 


Dn{b) 


sup 

0<t<oc 


N(0 rf 1 

t U 


(2.7) 


where. U r\j Unifo7'm{0, 1 ). 


Theorem 2 does not cover the interesting special case 6=1. For 6=1 we have 
the following (more special) result. 


Theorem 2A. Suppose that c > 1 and that Xi,X- 2 , . • . , X^ ore i.i.d. F where 


F{x) = < 


0 , 

cx, 

1 , 


— oo < a; < 0, 
0 < X < 1/c, 
1/c < X < oo. 


Then 


7 T - 0 V" = (4 - ') V 

where U ~ Uniform{0, 1 ) and 

P{Y^ < x) = I , , , 

^ \ 1 -c/(x + 1), 


= K 


X < c, 
X > c. 


( 2 . 8 ) 


Theorems 2 and 2 A do not cover the case of (very light) logarithmic weights 
which are of interest because of their connection to the results of Groenebooni 
and Shorsu’k 1981. The.se authors showed that with ip = xp 2 where tl> 2 {x) = 
- log(x(l - x)), the t/>-weighted Kolmogorov statistics 

D„{ip)= sup |F„(x) - F(x)|V'(a:), D:^{ip) = sup {¥n{x) - F{x))ip{x) (2.9) 

0<x<l ()<x<l 


have non-trivial large deviation behavior under the null hypothesis and hence have 
non-trivial Bahadur slopes as long as 

D„ W d(ib. F). Dt W rf+ (<ti, F) (2. 10) 


respectively under the alternative hypothesis F. Thus it is of interast to determine 
under what conditions (for what F's) (2.10) holds. A step in this direction is to find 
the Poisson boundary for Dn{^ 2 )- As it turns out, Dn{ip2) has the sjime Pois.son 
boundary distribution function as the Berk-.Iones statistic /?„. 
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Theorem 2B. Let F be the distribution function given by {1.5). If X\,...,X„ are 
i.i.d. F, then 


DM - 

a 

D„{^) - 

a 


sup 

0<t<oo 


sup 

0<t<oo 


N(t) g 1 
t U’ 

Njt) j 1 
t U 


where N a standard Poisson process and U ~ Uniform(0, 1). 

An alternative test statistic, ft„, which we have called the reversed Berk-Jones 
statistic in Jager and Wellner (2004), is defined by 

fl,, = sup K(Fo(x),F„(x)) (2.11) 

where X(i) and A(„) are the first and last order statistics, respectively. 

The motivation behind this .statistic comes from examination of the functions 
K(Fo{x),F{x)) and K{F{x),Fo[x)), for an alternative distribution function F. 
When F is stochastically smaller than Fo, we expect the Berk-Jones test to be 
more powerful than the reversed Berk-Jones statistic, since sup^ K{F{x), Fo(x)) > 
supj, K’(Fo(x), F(x)) in this case. However, in the case where F is stochastically 
larger than Fo, we have sup^. A'(F(x), Fo(x)) < supj, A'(Fo(x), F(x)), and -so we 
expect the reversed statistic to be more powerful. 

We do not yet know if R„ has a “Poisson boundary" . The question is: does there 
exist an alternative distribution function F such that when sampling from F we 
have 


ff(N) 

a 


for some functional s of a (standard?) Poisson process N? 

Before giving the proofs we state two results that will be itsed repeatedly in 
the proo&: the weighted Glivenko-Cantelli theorem of Lai (1974) (see also Wellner 
(1977a) and Shorack and Wellner (1986), page 410), and bounds for the sup of ratios 
given by Wellner (1978) and Berk and Jones (1979) (see also Shorack and Wellner 
(1986), Inequality 10.3.2, pages 415 and 416) that will be used several times in the 
proofs. Let Gn(() - n~* 53"=i l[o,r|(G) where ^i, . . . ,^n. • • • arc i.i.d. Uniforra(0, 1) 
random variables, and let I be the identity function on [0, 1). 

Theorem W-GC (Lai (1974); Wellner (1977a)). Suppose that \j) is positive 
on (0, 1), decreasing on (0, 1/2], and symmetric about 1/2. Then 

lira sup |1(G„ -/)t&|| = { ^ os according os xl>{t)dt 

Theorem (Ratio bounds). (Wellner (1978), Berk and Jones (1979)). For all 
X > 1 and 0 < e < 1 


P ( sup 


Gn(t) 

( 



exp{—mh{x)) 

exp(-uA''*‘(ex,f)) 


and 


'(.Sici 


(t) 



exp(— nth(l/x)) 

exp(— riA'‘*'(l — f/x, 1 — e)) 


where. h{x) s x(logx — 1) -f- 1 and where A"’*' is as defined in (1.4). 


( 2 . 12 ) 


(2.13) 
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Now we provide proofs for Theorems 2A, and 2B. 
Proof of Theoi'em 2. Let 0 < a < 1. We write 
|Fn(x)-x| 


D„{h) = sup 


o<;^i (x(l - x))‘ 
sup 


|Fn(x) -l| 
:F(xKn-" (^(1 - ^)f x:F(x)>n-» (^^(l ~ 


V sup 


x:F(X-* '' x:Frx)>„-o f'(x)(l - F(x)‘/6)» 


|F„(x) -i| 


: V ®“P 


|Fn(x)-x| 


Now 


Di,‘>(6) - sup 

x;F(x)<n-"^ 

|F„(x) -X 

= sup 




:F(X- - n^y'^'y x:fw<„-* n^) 

Fn(x)-I '/ x-F„(x) 

x:F(X-“ F(x)(l - F(x)‘/fc)<. V F(x)(l - F(x)‘/‘)‘ 

F„(x) 

“ ""'P ~pTr\ 

x;F(i)<n— > F ^I; 


F„(x) 


F„(x) 


- x:F(X-“ F(x)(l - F(x)>/6)6 F(x)' 

■" x:F(x)<n-" -F(^)(l " F(x)‘/t)S 


F„(x) 


F„(x) 


i:F(i“<n— F(x)(l - F(x)>/*>)'' irFM^n*" F(x) 
+ 2 sup 


< sup 

i:F(x)<n““ 

< sup 

x:F(x)<n“" 

< sup 

z:F(x)<n~“ 


x‘(l-x)^ 
F„(x) 


F„(x] 


F(x)(l - F(x)'/'-)‘ F(x) 

Fn(x) 


+ 0 ( 1 ) 


F(x) 

F„(x) 


((l-x)<> 0 


+ 0 ( 1 ) 


■?TTI ®“P 

X lx; I i:f’(x)<n- 
<Op(l)o(l)+o(l)=Op(l). 

On the other hand, 

sup ^-'Di.‘’(6) 

x;F(x)<n-« X li; 

F„(x) 

sup - sup 


.Ko^-) 


+ 0 ( 1 ) 


|Fn(x)-x| 


■ x:F(x)<n-" F(x) x:F(x)<n-» F(x)(l - F(x)>/<.)6 


< sup 


x:F(x)<n- XH1-X)» 


= o(l) 
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since 


x:F(x)<n- F(x)(l-F(x)‘/^f 

F„(x) -X 




> sup 


:F(x)<„-. F(X)(1 - F(x)>/^)‘ 
Fn(x) 


> sup 


:F(x)<n-» F(X)(1 - F(x)>/‘)» x:F(x“<„-" 


> sup . . 


- 0 ( 1 ). 


Concerning we have 


= sup 




F(x)(1 - F(x)>/‘)‘ 
|F..(x) - F(x)| 


x:F(x)>n-“ F(x)(l - F(x)*/0)<- 
|F(x)-i| 


"x:F(X- F(x)(1-F(x)'A)‘ 
|F „( X)_^F(XJ_ 
■<F(x)<1/2 F(x)(1 - F(x)>/‘)‘ 

|F„(x) - F(x)| 


x:,/2<F(x)0 F(I){1 - F(x)'/'-)'> 

1 |F„(x) - F(x)| 


+ 1 


F(x) 

|F„(x) - F(x)| 


+ 2 


x :,/ 2 < T ( x )<1 (1 - F ( x )>/ 6)6 

o(l) + o(l) + 1 


+ 1 


almost surely by Lemma 4.3 of Berk and Jones (1979) for the first term, and by 
the weighted Glivenko^Cantelli Theorem W-GC for the second term since 

Jo (l-li/f)t ■ u)-‘fru'’-‘ d« = 6r(l - 6)F(6) < * 

for b e (0, 1). Hence it follows that limsup„__ 3 ^, 0\^\h) < 1 almost surely. Putting 
all this together with the fact that 


sup • 

x:F(x)<n-" F(x) d 0<»<oo I 

finishes the proof of Theorem 2. 


F„(x) 


N(<) a ,,,, 

sup — ^ = l/U 


□ 


Proof of Theorem 2A. Since F„ = G„(F) where G„ is the empirical distribution 
function of i.i.d. Uniform(0, 1) random variables ?i, . . . ,Cn. we can write 
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Dn(l) ^ 


|G„(F(x))-x| 


* /I \ 

0<x<l X(l-X) 

|G„(cx)-x| V / |l-ar| 

V WTTTT 

0<x<l/c X) \/c<x<l 3^1^ 

\nGn{t/n) - t/c\ y 


= sup 


oo<n (</c)(l - t/{cn)) 
\N(t) - t/c\ 


sup 

0<Koo 


t/c 

= (c sup ^-l) V 1 V 

\ 0<t<oc t J y ^ 


V 


since c > 1 and since the process {nG„(f/n) ; 0 < f < n} converges weakly to 
the standard Poisson process N in a topology that makes the weighted supreinum 
functional in the last display continuous; see e.g. Wellner (1977b), Theorem 7, 
page 1007. Computation of the distribution of Yc is straightforward. (Note that 
this distribution has a jump at c of height 1/(1 + c).) □ 

Proof of Theoj'em 2B. Let 0 < a < 1. We write 

Dn{ip 2 ) = sup |F„(x) - x|V^ 2 (x) 


0<x<l 

sup 

x:F(x)<n" 


,(x) - x\ip 2 {x) \J sup lF„(x) - x\ri} 2 ix) 

x:F(x)>t»~“ 

= sup |F„(x) - x| 02 (ar) W sup |F„(x) - x|V^2(3^) 

x:F(x)<n““ i:F(x)>n““ 


We first deal with Note that 


< 


sup |F„(x) -x|i/»2(x) 

x:F(x)>n~'* 

sup |F„(x) - F(x)|^2(ar) 

x:F(x)>n~“ 

+ sup |F(x) - x|V'2(3:) 

x:F(x)>n~“ 

|F„(x) - F(x)| r./ w / N 
sup F{x)lp2(x) 

x:n-«<F(x)<l/2 


F(x) 


x:l/2<F(x)<l (l-F(x))V 

|F„(x) - F(i)| 


sup 

x:n-"<F(i)<l/2 


F(x) 

|F„(x)-F(x)| 




+ 1 


= o(l) + o(l) + l 
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almost surely by Lemma 4.3 of Berk and .Jones (1979) or Wellner (1978) for the first 
term, and Theorem W-GC for the second term. Here we also used ^(x)F(x) < 1 
for 0 < X < 1/2, and (1 — F(x))‘^/'*V’ 2 (^) <1 for 1/2 < x < 1. 

To handle Dn\iJ; 2 )y that 


- sup 


Fn(x) 


= sup 

x:F(x)<n~** 


n{x) - X\ 


Fix) 


F{x)\l} 2 ix) - sup 


¥n{x) 


= .sup \ ^ ^(3^)^2 (j^) V sup 

x:F(x)<n-^ x:F(x)<n~ 


:F(x)<r 

F„(x) 

~ETT 

x:F(x)<jt-» 


:F(x)<n“" 

X - Fn(x) 
F{x)<n-° P'(^) 


F{x)ip2{x) 


Fn(x) 


x:F(x)<n-“ ^ 


F„(x) 


F(x)V^(x)- sup . 

x:F(x)<n~“ ^ x:F(x)<n~‘ 


+ .sup Xtp2{x) 


F„(x) . 

x:F(x)<n-“ ^ \X) 


sup 

:F(x)<n~" ■* \^) 


Ifn(l) 


+ 0 ( 1 ) 


< sup 

x:F(x)<u~“ 

< sup 

x:F(x)<n~" 


F„(x) 


F(x) 

F„(a:) 


(F(x)t/) 2 (x) - 1) 


+ o(l) 


sup 

x:F(x)<r«~‘ 


F(x) 

<Op(l)o(l) + o(l) = Op(l). 


F(x)V^(x) - 1 


+ 0(1) 


On the other hand, 


sup 

x:F(x)<n~" 


Fn(x) 

Fix) 



F„(x) 

= sup - sup 

i:F(x)<n~" ^ x:F(x)<ji~‘* 

< sup x0(x) = o(l) 

x:F(x)<n~" 


|Fn(x) -X| 

Fix) 


F(x)V^(x) 


since 


|F„(x) -x| r./ N , / X 
®*‘P Fix)tp2ix) 

x:F(x)<n-“ ^ i^} 

^ F„(x) -X , 

- ^^P — F / ~ r f^i^)t2{x) 

i:F(i)<n-« ^(2:1 

Fn(a^) 

- - >/ ! F'ix)ij} 2 ix) - sup Xt/i 2 (a;) 

x:F(x)<n~“ ■* V^/ x:F(x)<rj~“ 

- ^^*P ^^^(1 “ o(l)) - o(l). 


t:F(x)<n-“ F(X) 


Combining these pieces as in the proof of Theorem 2 completes the proof for D„(^). 
The proof for D^ixi> 2 ) is .similar (and easier). □ 


Copyrighted material 


328 


L. Jager and J. A. Weltner 


3. A consistency result 

Theorems 2, 2A, 2U suggest that we might expect classical behavior for the weighted 
Kolmogorov statistics under fixed alternatives F sufficiently •‘inside” their resjrec- 
tive Poisson boundaries. Here are two of the expected consistency results. They are, 
in fact, corollarieis the weighted Glivenko Cantelli Theorem W-GC in Section 2, or 
of general Glivenko Cantelli theory (see e.g. Dudley (1999) or Vaart and W^ellner 
1996). 

Theorem 3. Suppose that X] , X 2 , . . . are. i.i.d. F on [0, 1) and 0 < b < 1 . 

(i) //F[(X(1 - A'))-'’) < 00 . then 


Pn(b) 


IFnW - 
(ar(l -i))*- 


|F(j) - x| 

o<"^,(i(l-x))'> 


_ d(b, F) < 00 . 


(ii) If E[{X(l — X)) *’] = oc, then limsup„_ 3 (, D„{b) = +00 a.s. 

Theorem 3B. Suppose that Xi,X 2 ,... are i.i.d. F on [0,1] and = 

-log(x(l -x)). 

(i) If F[t& 2 (X)j < 00 , then 

D„{tp)s .sup |F„(x) - i|t!> 2 (x) sup |F(x) -x|t(> 2 (i) = d(i(' 2 -^’) < 00 . 

0<x<l 0<I<1 


(ii) If E[ 4 } 2 (X)\ = 00 then limsup„_oo D„{^) = +00 almo.st .surely. 
Proof of Theorem 3. Note that 


\D„{b)-d(b.F)\ 


sup 

0<I<1 


|F„(x) - F(x)| 
(x(l -x))*- 


|G„(F(x)) - F(x)| 
oix<i (x(l-x))'- 


sup 

0<u<l 

0 


|Gn(u) - U| 

(F-H«)(l-F->(«)))‘ 


if 

I (F->(u)(l-F->(u)))'’ 

by Theorem W-GC, or by part A, of Wellner (1977a) and remark 1 on page 475. But 
(3.14) holds if and only if the stated hypothesis holds by the fact that F“'(f/) = 
X ~ F for U ~ t/(0, 1). □ 

Remark 1. Note that for the “Pois.son boundary” distribution function F(x) = x* 
for D„(b) 


F[(X(1 


- X))-^l = f 

Ja 


bx"-' 


(x(l - x))>> 


di 




x(l — x)*" 


dx = CX5. 


so the hypothesis of Theorem 3 i>art (i) (just) fails. On the other hand, if F(x) = x'' 
with b < c < 1, then 


£1(X(1-X))-*1= [' 
Jo 


■ dx 


f' 1 

= c I —rrr — r; rr "-f < ot'i 

Ja x‘+*’-‘'(l - x)‘ 


(x(l -x))*’ 

s<j the hypothesis of Theorem 3(i) holds and /3„(6) — *a... d{b.F). 
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Remark 2. Note th<at for the “Poisson boundary” distribution function F{x) = 
(1 + log(l/ar))“' for the statistic Dnii’ 2 )^ 

ErlMX)] = / log ( d. = 00 

SO the hypothesis of Theorem 313 part (i) (just) fails. 

4. Further problems 

Here is a partial ILst of open problems in connection with the statistics disciussed 
here and in .lager and Wellner (2004). 

Question 1. What are the theorems corresponding to Theorem 3 in the case of 
R„ and fi„? In other words, for exactly which alternative distribution functions F 
does it hold that 


Rn ^H.s. anp I\ {F{x), Fo(x)) = r(F,Fo)? 

X 


For exactly which alternative distribution functions F docs it hold that 
a,. sup A-(F„(i),F(i)) = f(F,Fo)? 

X 


(4.15) 


(4.16) 


Question 2. For alternative distribution functions F .such that (4.15) holds, can 
we obtain iLscful approximations to the power of via limit theorems for 


vAI(/?.„ -r(F,Fo)) 


along the lines of Raghavachari (1973)? Similarly for F’s for which (4.16) holds 
for R„? 


Question 3. Donoho and .lin (2004) consider testing /7o : F = N{(), !) = <!> versus 
Hi : F = (1 — c)N(0, 1) + cN(/r, 1) where and ft = fi„ = y/2r log n for 

(3 > lf2 and r ^ 0. They show th<vt h n<ituriil ^^cletcction houiicliiry is given hy 


r*(/5) 


/3-1/2, l/2</3<3/4 

(l-\/r^)2, 3/4</3<l. 


How do the statistics /?„, F„, and 7\'„(l/2) compare along the “detection boundary” 
of Donoho and .lin (2004) Note that Donoho and .Jin (2004) find that D„(l/2) and 
R„ have quite comparable power behavior for their testing problem, but they show 
that D„(l/2) has better power in the region r > r*(/3) and 3/4 </?<!. 

Question 4. What is the limiting null distribution of F„? 
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A theorem on compatibility of systems of 
sets with applications 

A. Goswami ^ and B. V. Rao ^ 

Indian Statistical Institute 

Abstract: A general theorem on compatibility of two systems of subsets of a 
separable metric space is proved. This theorem is used to deduce results about 
points of continuity of functions, filtrations and operator semigroups among 
other things. 


In this paper we prove the following result which, in spirit, is very much similar to 
a classical 1908 result of W. H. Young (2, page 304] on real functions and unifies 
several well-known results. 

Theorem. Let X be. a separable metric space and I be any uncountable subset of 
the. real line. Suppose that {At : f € /} and {Bt : t E 1} are two families of .subsets 
of X with each At closed and satisfying the following condition: 

(* ) for each t £ I, there is a St >0 such that At D Bg, whenever s 6 

{t,t + St)nl. 

Then for all but countably many t € I, At D Bt. The .same conclusion holds if, 
in condition (*), (/, f + <5t) is replaced by {t — St, t). 

Proof. .Set Is = {t G I : At D B„ whenever s € (t, f -f <5) H /} and let p denote the 
metric; on the space X. If the conclusion were false, then there is some > 0 and an 
uncountable set S C Is such that for all t € 5 the assertion fails. Since each At is 
closed, we can get c > 0 such that for uncountably many t € S there exists Xt € Bt 
such that p{xt,Af) > t. Cutting down S, if necessary, we can and shall assume 
that this holds for all points t in S. Again no loss to assume that S is contained 
in an interval of length smaller than S. Now ,\{t<t' are two distinct points of S, 
then noticing that t' € {t,t + S) we see that p{xt,xt>) > e. Thus {xf : t € 5} is an 
uncountable set of elements of S with any two of them separated by distance larger 
than e, contradicting separability of X. The other part is similarly proved. □ 

The following propositions illustrate some applications of the theorem — perhajxs 
there are others. In what follows, the closure of a set A is denoted by A. 

Proposition 1. Let X and I be as above. Let {Bt : t & 1} be any family of subsets 
of X . Then for all but countably many t € I 


c n \J{Bs -se {t,t + S)ni} 

S>0 
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and 

B,cf] lj{«, :s6(/-i,0n/} 

<>o 


Proof. Fix (5 > 0 and put A, = 1J{ /?, : s G (t, t + i5) PI /} for oat'h < 6 /. Then the 
Theorem implies that Bi C Ai for all but countably many t € 1. The i>roof is now 
completed by running S through a se<iuence decreasing to zero. The second part 
follows similarly. □ 


Proposition 2. Let f be any function defined on an open interval I into a separable 
metric space X. For each t & I, define 

Lf=r\m-S.t)nl] f., = 

5>0 ^>0 


and 


s>o 


<>0 


Then for all but countably many I G I , Lf = Lt = Rf = Ri . 


Proof. Since by definition, Lj* C /.< aiul R'f C Ri for all f G /, it suffices to show 
that Lt C Rf an<l Rt C L'} for all but cou ntably many i € I. Fixing <5 > 0 and 
putting for each t € I , Ai = /[(t, / + i) n /], ii, = Lj, it follows from the Theorem 
that Bi C At for all but countably many f G /. Running 6 through a sequence 
decreasing to zero, one obtains Li C /?'/ for all but countably many t 6 /. The 
other inclusion Ri C L'l is proved similarly. □ 


Corollary (W.H. Young [2]). Let f be any real-valued function defined on an 
open interval /. For every t € I, let 


and 


/(<-) = limsup /(s) 

•— r 


/((+) = limsup /(.s) 
a>i 


/(t-) = lim inf f{s) 
»<( 


/(f+) = lim inf /(s) 

— I 

»>l 


Then for all but countably many I G /, /{/-) = /(<+) and [{t—) = f{f+). In 
particular there is a countable set D C I such that for t ^ I — D, if one of the limits 
lim. .1 f{s) or lim. ., /(«) exists, then so does the other and the two are equal. 

«<( »>i 


Proof. In view of the order preserving homeoinorphisni x — > firctan x, it suffices to 
consider bounded / only. To complete the proof now, one has to simply observe 
that /(f— ) = sup /(f+) = sup Kf, f(t—) = inf Z,J* and /(f+) = inf /if in the 
notation of Proposition 2. □ 


Remark 1. It is possible to improve the above corollary as follows; 

“For any function / on an open interval / into a separable metric space X, 
there is a countable set DC/, such that, for f G / - D, if either lim,^(,,<( /(s) or 
lim,_t,,>i /{«) exists in X, then / is continuous at f.” 
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To see this, we first note that, by Proptwition 2, tliere is a countable set Del, 
such that. = Lt = R'l = R,, (or t € I - D. For such a t, existence of either of 
the limits stated in the proposition clearly implies that all these four sets Lt, 

R^, Ri are equal to one singleton set, i.e., that / is continuous at t. 

As an immediate con-sequence of this, we get 

Corollary. Any function f : I X which has , at every point tel, either a left 
limit or a right limit, can have at most countably many points of discontinuity. 

Similar technique yields some results on differentiability proix*rtics of a real 
function on an interval. Let / be a real-valued function defined on an o|jen interval 
/. For I € /, let 


Dt+ and Dt- are defined analogously with the only exception that t < u < v < t + S 
and t — 6<u<v<t are replaced by t<u<v< t + S and t — S<u<v<t 
respectively. It should be pointed out that the closures in the alx)ve definitions are 
closures in the extended real line. Using arguments similar to that of Proposition 2 
we get 

Proposition 3. For all but countably many tel 


From the definition of /?(_, it is clear that in case /?,_ Ls a singleton then / must 
have a left derivative at t. Similar argument applies for as well. This easily 
yields the following 

Corollary. If f : I —> R is such that for all but countably many t in I, either D'}_ 
or is a singleton then f is differentiable at all but countably many points. 

A more .satisfactory result would have been to replace the hyjKjthesis in the 
above corollary by the apparently weaker condition that / has a left derivative or 
a right derivative at all but countably many t. The main problem appears to be 
that / may have a left (right) derivative at a [mint t without £>(_ (Oi+) being a 
singleton. But can this happen at uncountably many ])oints t ? We do not know. 

For the next few propositions, which are of interest in the context of stochastic 
processes, we fix the following set-up and notations. {D.lF. P) denotes a probability 
space where f is the P-completion of a countably generated cr-field. It is well known 
that P is then a polish space with the metric p{A, B) = P(AAB), provided one 
identifies st*ts A and B in IF whenever P(AAB) — 0. For two sub-cr-fields A and B 
of T, we say that A and B are equal upto P-null sets, and write .4 ~ B to mean 
that they generate the same fr-field on augmentation by P-null sets of IF. Note 
that any sub-(r-field of F, on augmentation by P-null sets, Itecomes a closed subset 
(modulo the above identification) of the separable metric space F. VV'e will iLse the 



and 



= Dt- = D,+ = 
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same notation for a sulvcr-field of T as well as the closed subset it gives rise to. In 
this language, A'^ 3 simply means that A and 3 are equal as closed sets. Also for 
any family {!Fa,n € A} of cr-fields the smallest <r-field containing J-a for all a € A 
will be denoted by Vn&A^a- 

Proposition 4. Let {Tt,t € /} he a monotone non- decreasing family of stib-a- 
fields of T where I is an open interval. For each t £ I, let Tt- = Vs<t/'s and 
:Fi^ = ris>t.F8. Then for all btU countably many t £ I, 

J-t- ~ J-f ~ 

Proof. Take At = Tt- and jS, = .Ft+, and note that 5, C At whenever s < t. From 
the Tlu^orern one gets Bt C At for all but countably many t £ 1. The proof is now 
completed in view of the fact that C ft C ft+ for t £ I. □ 

As a consequence we have 

Proposition 5. If {Xt,t > 0} is a stochastic process on and if, for 

each t > 0, ft = rr(X„,0 < u < t), then for all but countably many f > 0, 
ft- ^ ft ft+- 

It is interesting to note that the exceptional set of Fs in the above proposition, 
to be denoted by D{X), depends only on the law of the process {Xj}; that is, for 
two processes {Xt} and {Vi} on {U,f,P), having the .same finite dimensional dis- 
tributions, D{X) = D{Y). In particular, if {Xf,/ > 0} is a process with stationary 
increments, then, for any s > 0, denoting the process - X«,f > 0} by {Vi}, 

one has D{X) = D{Y). If moreover, the increments of X are independent, then one 
can show, using the above, that the complement of D{X) is a right interval and, 
hence, has to contain (0, oo). The same argument can be used to show that D{X) 
is actually empty. Thus we have 

Proposition 6. //{X/, t > 0} is a proce.ss on {il,f, P) with .stationary; independent 
increments, then for all t, ft- ~ ft ~ 

This is what is u.sually known fis Blumenthars 0 — 1 law (s(;e for example [3]), 
for which the usual proof is via a riglit continuous modification of the process {Xf } 
and the strong Markov property. 

Proposition 7. Let I be any open interval and {Qf>t € /} a7iy family of sub-a- fields 
of f. For each t £ I, define 

Qt+ = Fl 5 >o V {Qg, t < s < t 6} , Qt+ = n,5>o V {Qa, t < s < t 6} 

Q'l- = n, 5 >() y {(^g,t-6 <s<t}, gt-= ns>o v t - d' < s < t} 

Then for all but countably many t £ I, ~ ~ ^ Qt+- 

Proof. Fix <5 > 0 and take At = V{^s,f < s < f + <5}, Bt = Qt-- The Theorem 

implies that, for all but countably many t £ I , At D Bt. By arguments similar to 
those used in Proposition 2, one concludes that Qt- C Gt+ countably 

many t £ I. Similarly one shows that C for all but countably many t £ I. 
The proof is now complete in view of the inclusions Qf_ C Gt- and Gf.^ C Gt-^- O 

In particular, this gives 


Copyrighted material 


336 


A. Goswami and B. V. Rao 


Proposition 8 (V. S. Borkar [l]). For a stochastic process > 0} on 

P) 


ni>o<7(Xg, t-S < s < t) c\s>o(y{Xs,t - S < 8 <t) 

r\s>o(^{Xs, t < s <t + S) ni>o<T(A'«, t <8 <t + S) 

for all but countably many t > 0. 

Remark 2. In an analogous mamier, one gets 
For a stochastic process {Xt,t > 0} on (fl, P) 

Hrf>o (t{Xu - Xg, t-6 < 8 <u< t) 

~ Fltf>o cr{Xu — Xg, t — S < 8 < u <t) 

~ Hrf>o (t{Xu - Xg, t < 8 < u < f + S) 

~ nj>o <t(Xu - Xg, t < 8 < u < t + S) 

for all but countably many < > 0. 

We end this note witli one more application which may have interesting conse- 
quences for Markov processes. 

Proposition 9. Let {Tt,t > 0} be a semiyroup of bounded linear operators on a 
separable Banach space B such that, for every x e B, lim<_o+ Ttx exists in the 
strong opemtor topology. Then {Tt,t > 0) is strongly continuous. Moreover, the set 
{x € B : Tq+x = x) is precisely the closed span of \Jt>oTtB. 

Proof. By the uniform boundedness principle, Tt are uniformly bounded for t in 
any bounded interval. For any x e B, the map t — > TtX has, by the corollary 
following Remark 1, only countably many discontinuities. Separability of B and 
the boundedness property noted above permit us to choose one countable set of 
Vs, outside of which the map t TtX is continuous for all x € B. The semigroup 
property, on the other liand, would assert that the continuity points form a right 
interval. The proof is complete. □ 

Remark 3. Without separability of X the Theorem fails. For example, put / = R, 
.At = (t, oo), and Bt = [t,oo) and let X be the real line with discrete topolog.v. 
It is c;lear that the Theorem does not hold. However in the non-separable case the 
Theorem will remain true if we replace countably many by at most H many where 
K is the weight of X, that is, the least cardinality of a dense set in A". Interestingly, 
for finite X, the exceptional set of Vs cannot have a right accumulation point of 
order eciualling card(X). 
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A question of geometry and probability 

Richard A. Vitale' 

University of Connecticut 

Abstract: We introduce the Aleksandrov-Fenchcl inequality, apply it to a tail 
bound for Gaussian processes, and speculate on a further connection. 


1. Introduction 

Some time ago I brought a question involving geometry and probability to Herman 
and that led us in an interesting direction [12], To celebrate this occasion, I am 
happy to bring another such questitm. 

Recall the planar isoperimetric inequality, whicli says that for a convex body A' 
of area A(K) and perimeter L(K) 

■ A(K) < L'^iK) . (1) 

Consider now a 2 x 2 matrix M of independent A’(0. 1) variables and the image 
lx)dy A/A'. Inserting into (1) and taking expectalioits gives 

in ■ E [^(A/A')] < E [L'^(MK)] . (2) 

However, it is the case that the following stronger inequality holds: 

in ■ E [v4(A/A')l < [EL{Mh')f . (3) 

It is possible to verify (3) as a simple exercise in Gaussian determinants, but one 
cannot say that this approach gives a satisfying explanation of what is really going 
on, for example, why (2) and (3) differ precisely by Var/,(A/A'). 

In fact, a deep theory is in the beickground. and the question of the title is to 
ask how it can be systematically exploited in this and other stochastic contexts. In 
the next sections, we briefly outline the theory and then turn to a specific (juestion 
connected with Gaussian processes. 

2. The Aleksandrov-Fenchel inequality 

Tile bound (3) can be regardixi as the first in an infinite .sequence of incxjualities. 
each of which is a stochastic formulation of the Aleksandrov-Fenchel (A-F) in- 
equality in convex geometry. The A-F inequality is well-known to specialists as 
a powerful tool, having as implications the isoperimetric inequality (in all dimen- 
sions) and the Brunn-Minkowski inequality [8]. It has been successfully applied to 
problems in combinatorics as well as to the resolution of the van der Waerden per- 
manent conjecture [6, 7, 14, 15, 16). Interestingly, the original plan for the classic 
compilation [4] was to have a sequel based entirely on the A-F inequality. A closely 
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related inequality on mixed discriminants (1, 2, 3] has found applications in stochas- 
tic settings. In view of this background, it is surprising that the A-F inequality itself 
has not found more applications in stochastic settings. One exception Ls questions 
in the theory of GaiLssian processes, to which we turn in the next section. 

A quick introduction to the A-F inequality goes as follows. It is part of Brunn- 
Minkowski Theory [13], which deals with the interaction between volume evaluation 
and vector addition of convex bodies (i.e., compact, convex subsets). For convex 
bodies A'l , K 2 , . . • , A'n in R** and positive coefficients Ai , A2, . . . , A„ , 


n 

vol (Ai A 1 -f- A2A2 H |-A„Afn)= ^ Aj, A^2 • • • AjjV’(A'i, , A'ij, . . . , A'jj), 

^ 1 1*3 1 ** * i*d J 

( 4 ) 

where, without loss of generality, the coefficients V(-) are taken to be symmetric in 
their arguments. The A-F inequality then asserts the following: 

Theorem 1. For convex bodies K\, A'2 , ... ,Kd in R*^, 


V^KuKi, Kz Ki)> V{Ku Kuh'z Ka) ^(A,, A'j, A3 Aj). (5) 

For the special case of a parallel body K -I- XB (B, the unit ball in R**), (4) 
simplifies to the Steiner formula [17] 


d 

vol(A* + XB) = J2 A’cJi Vd_,:(A'), (6) 

1=0 

where u;, is the volume of the unit ball in R’ and the Vi{K) are the intrinsic volumes 
of K (Vo = !)• Then (5) translates to the sequence {i! Vi(A')}J^Q being log-concave: 

(dV;(A'))2>(i-l)!V;_,(A') • i = l,2,...,d-l (7) 

(elsewhere this property has been called ultra logconcavity of order oo [10, llj). 

3. Intrinsic volumes and Gaussian processes 

The theory of GaiLssian processes has been heavily influenced by convex geometry 
[5, 9, 18, 22, 23). Here we draw especially on [19, 20, 21, 22). 

A popular approach to Gaussian processes is canonical indexing: suppose that 
A C R*^ and that Z = (Zi, Z 2 , . . . , Zd) are iid N{0,1) variables. A canonically 
indexed Gaussian process Xa = {Xt,t € A} has the form Xt = ^jtiZi = <t,Z> 
(this process evidently has “rank” no greater than d, but similar definitions can 
be made in Hilbert space for more general processes). If A = A', a convex body, 
then intrinsic volumes come into play. For j = 1,2, ... ,d define the vector process 
X{* = X^^\ . . ., xl^^^ , where the components are independent copies of X(. 

Further, define the (random) convex body Xj^ = conv{X/*, t € K} C RJ. Then 

J - 

The Wills functional is given by 
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\V{K) = E 


oxpsup 

<€A' 




and has the generating function expaiLsion 


(9) 


= E ( 10 ) 

An important consideration for Gauasian processes is “size,” which is tradi- 
tionally interpreted as sup, X|. Tail prolmbility lx)unds are of various types, and 
we illustrate the application of the preceding ideas to a sharpening of a bound 
in [22], Fix K, and recall that the A-F inequality implies that Oj = j! Vj-(A') is a 
log-concave sequence: 


logoj < logn, -(- (logoi+i - logaOO' - 0, 

which implies 

F,(A)<— j . 

Substituting into (10) and summing j = 0, . . . ,oc yields 


IF(rA') < 
< 


i!F((A') 


«! Vi(h') 
(2tr)‘/2m;(A) 


( Vcxn 

r(t + l)V.+,(A-)rl 

V(i+i)v;+,(A')j P 

v/2iV,(A') 


exp|m,(A’)r] , 


where 


m,(A') 


iVi{K) 

v/^K._,(A-)' 


(11) 


A straightforward application of Markov’s inequality then provides the bound 


P(sup A, >a)< inf 
1 




»! Vj{K) 
(27r)'/2m;(A') 


exp 


(mi (A') - a)2 
2<t2 


]}' 


where a > 0 and = sup,g^ EXf = sup,g,g ||(|p. 

This brings us to the issue mentioned in the introduction: the way in which 
the values mi(A') have arisen suggests that they may be natural parameters of 
the process for other questions as well. It Ls easy to verify that mi(A') is at once 
Asup,g^ Xi and proportional to the mean width of A' and, as such, has linear 
dimension. The succeeding m, also have linear units, and evidently provide alternate 
size measures for both A' and {A|,( 6 A'}. Their asymptotic behavior reflects the 
regularity of the process (see (24) for details), but it seems clear that their specific 
values must also calibrate successive t-th order proiterties of some type for the 
process. Wliat these are remains for investigation. 
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Generalized Accept-Reject sampling 

schemes 

George Casella* *’* Christian P. Robert^’t, and Martin T. Wells‘S * 

University of Florida, Universiti Faris 9 - Dauphine and Cornell Univeriiity 

Abstract: This paper extends the Accept-Reject algorithm to allow the pro- 
posal distribution to change at each iteration. VVe first establish a necessary 
and sufficient condition for this gcnoralizo<i .Accept-Rejcict algorithm to be 
valid, and then show how the resulting estimator can be improved by Rao- 
Blackwellization. An application of these results is to the perfect sampling 
technique of Fill (1998), which is a generalized Accept-Reject algorithm. 

1. Preface by GC 

This paper is especially appropriate for a volume dedicated to Herman Rubin, as 
he was the first person who ever mentioned the Accept-Reject algorithm to me 
although, at the time, I didn’t understand a word that he was talking about. I was 
a graduate student at Purdue in the mid-70s, and Herman was always working on 
some problem, and if he saw you in the hall he would tell you about it. One day 
he told me he was working on an algorithm that generated “test exponentiaI.s” to 
get normal random variables. I had no idea why anyone would want to do such a 
thing (remember the 70s ? - we were proving theorems!). Herman eventually wrote 
a technical report, but I don’t think I ever read it and don’t know if it ever was 
published. And then Herman got interested in other things. But when I think of 
this story I often wonder how much further along Monte Carlo methods would be 
today if Herman kept his interest in those “test e.xponentials” ! 

2. Introduction 

Accept Reject algorithms are based on the u.se of a proposal distribution g which 
serves to simulate from a given target density /, when the ratio / /g is bound(?d by 
1/e, say. The standard Accept-Reject Algorithm is 

Algorithm Ai — Accept-Reject. 

At iteration i (f > 1) 

1. Generate Xi ^ g and Ui U{[0,\]), independently. 

2. If Ui < ef{Xi)/g{Xi), accept Xi ~ /; 

3. otherwise, move to iteration i -f 1 . 
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Since the inequality is not always satisfied, tlie algorithm generates pairs {Xi,Ui) 
that are rejected. These pairs can be recycled in many ways, including the Rao- 
Blackwellizing approach by Casella and Robert (1996) which replaces the stan- 
dard estimator S based on the accepted pairs with the conditional expectation 
E[5|a:i, . . . ,a:„,n|, which integrates out the uniform variables. 

We give in this note a necessary and sufficient condition for a generalized 
Accept-Reject algorithm to be valid and show that Rao-Blackwellization al.so aj> 
plies here, allowing the use of the rejected samples to produce an improved estima- 
tor. 

This work was partially motivated by that of Fill (1998), who developed an 
interruptible perfect sampling algorithm as an alternative to Propp and W'ilson’s 
(1996) coupling from the past technique. Perfect sampling results in iid outjjuts 
from the stationary distribution of the MCMC Markov chain (set; Dimakos (2001), 
Robert and Casella (1999) or Casella, Lavine and Robert (2000) for introductions 
to perfect sampling). At the core of Fill’s algorithm, de.scribed in Section 5, is an 
Accept-Reject algorithm with the feature that the proposal distribution can be 
modified at each step. 

The possibility of changing the proposal distribution at each failure/rejection 
implies that his method does not fall in the category of a standard Accept-Reject 
algorithm. It is this more general Accept-Reject algorithm that we are interested 
in. 


3. A generalized Accept-Reject algorithms 

W^e consider the following extension to the standard Accept-Reject algorithm: 

Algorithm A 2 — Generalized Accept-Reject. 

At iteration i {i > 1) 

1. Generate Xi (ji and f/i ~ ^((6> I))? independently. 

2. If t/j < eif{Xi)/gi{Xi), accept Xi ~ /; 

3. otherwise, move to iteration i+1. - 

Thus, at each iteration i {(.) < i < 00 ), the algorithm uses a different jjair (gi,€i) 
such that eif{x)/gi{x) < 1, uniformly in x. Each of these pairs is thus acceptable 
for the original Accept-Reject scheme. However, the proposal distribution keeps 
changing at each reject iteration and may be more adaptive than the .single Accept- 
Reject proposal distribution, or even the adaptive rejection algorithm of Gilks and 
Wild (1992), which uses an envelope on the target density. 

If the proposal distribution is parameterized by a parameter d, we can select a 
pre-determined sequence of values of 0 to monitor the performance in simulating 
the distribution of interest /. The value of 0 at the time of acceptance can then be 
exploited in further simulations without jeoi>ardizing the independence properties 
of the algorithm. 

The extension of the Accept-R(;ject Algorithm does not hold in full generality, in 
the sense that the distribution of the accepted random variable may not necessarily 
be the correct one. A minimum requirement must be imposed on the sequence of 
the Ci’s (and hence on the < 7 ,’s). 
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If we denote by Z the (possibly defective) random variable that is output by 
Algorithm A 2 , Z has the cdf (for simplicity, in the univariate continuous case); 


P{Z < z) 

oc 

= 53P(Z<2,Z = A'i) 

t=l 


= Y. f{Xi)ei/g,{X,)) n PiUj > f(Xj)(j/gj(Xj)) 




= fl[ ^^3.(x)dxll(l-(j)= f /(x)dxX^e/[](l-6j). 


Therefore, the output is distributed from / if ~ *j) “ The fol- 

lowing theorem ties this condition to the divergence of an associated series. 

Theorem 3.1. The Generalized Accept-Rejext Algorithm is valid if, and only if, 
the series log(l ~ ti) diverges, since 


OC i— I oo 

^ f, 1^(1 — fj) = 1 if and only if log(l — <■) diverges. (1) 

isl i=l 1=1 

Proof. Note first that rij=i(l “ *j) necessarily converges to a limit less than, 
or equal to, 1 since 

(a) for every n > 1, 

n i— 1 

?n = rT(i ~ tj) 

.=1 j=i 

= fl+(l— ei){t2-|-(l— f2)[. ..(1— en-l)Cn) • • •]} 

< ei+(l— fl){e2 + (l — C2)[--. fti-l + (1 — fn-l)) • • •]} 

= 1 . 


(b) the sequence is increasitig with n. 

Now, {{„} converges to 1 if, and only if, for every 0 < t; < 1, there exists no such 
that 

?„ > 1 - 1 / for n > no. (2) 

The condition (2) is equivalent to, for n > no. 


Cl + (1 -fi){f2 + (1 -f2)[- -(l -f-.-i)cr.)..,]} > 1 - V 

l-ei-t) 


f2 + (1 - C2){<3 + • . . (1 - C„_l)en] .. . } > 


<=>... 

•O’ C„ > 1 - 


1-Cl 


n"=.(i-ci) 


1 


n 

1-c, 


(3) 


The sequence u„ = n"=ri*(l ”*•) uj = 1 is decreasing and nonnegative. Thus, 
it either converges to 0 or to q > 0. If it converges to 0, that is. if J31og(l — f,) 
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diverges, the ratio rj/un goes to +oo with n and the right hand side in (3) is negative 
for n large enough, which ensures that (2) liolds. 

If {u;,, } converges to a > 0, the series log(l — c,) converges and log(l — c„) 
goes to 0 as n goes to infinity by Cauchy’s criterion. Thus, {e,,} converges to 0. 
Therefore, for S small enough, there exists ni such that Cn < S for n > Ui. If 
one choo.ses r such that 1 - ^ = (5 and if (2) holds, one gets < (5 < e„ for 
n > max(no>^i)> which is iinpos.sible. □ 

This result has several implications. First, it shows that continued modifications 
of the proposal distribution in the Accept-Reject algorithm are legitimate as long 
as the acceptance rate €« does not converge to zero too fast. Second, the acceptance 
rate does not have to go to 1 with n, so some e„’s (even an infinity of them) 
may be equal to 0, and the algorithm remains valid. Note however, that if one 
is equal to 1, the sequence terminates. 

Theorem 3.1 applies to and validates the generalized Accept-Reject algorithm 
not only when is constant, but also when the c„’s are periodic in n, and when 
the sequence {€„} is uniformly bound<Kl away from 0. 


4. Rao-Blackwellization 

The output from the generalized Accept- R('ject algorithm is as follows: A .s(xiuence 
Ti, V 2 , . . . of independent random variables is generated from the gi's along with a 
corresponding sequence U\,U 2 ,-- - of uniform random variables. We show how to 
extend the results of Casella and Robert (1996) to this more general algorithm. 

Given a function h, the Accept-Reject estimator of E-^{/?(X)}, based upon a 
.sample Xi,...,Xj, with t fixed, is made of the t accepted values among the Yj's 
and is given by 

1 t 1 ^ 

f, = - ^ h{Xi) = - ^ I(£/, < Wi)h(Yt), (4) 

1=1 ' i=l 

where N, the number of V^’s generated, is a random integer satisfying 

AT N-1 

I(Ui < Uq) = t and I{Ui <iv,) = t- I, 

1=1 1=1 


with Wi = f{Yi)ci/gi{Yi). By the Rfui-Blackwell Theorem, the conditional expec- 
tation 


h = 



N 


Y.H'Ji<Wi)h{Yi) N,Yu-..,Yn 


1=1 


( 5 ) 


improves upon (4). 

The joint di.stribution of {N, V^i, . . . , V}v, Ui, . . . ,Ui\i) is given by 


P{N = n,Yi < yi,...,Y„ < y„,f/i < ui,...,C/„ < u„) 

/ Vn rtn rUn-i 

gn{Vn){Uu/^Wn)(lv„ ... .9l (^1 ) • • • P,. - 1 (v«- 1 ) 

•OO J — oo •/ — c » 


f-1 


•oo 
w-1 


X Y 

3=f 


where w = ^f{v)/g{v) (with appropriate subscripts) and the last sum is over all 
subsets of {1, ... ,n — 1} of size t — 1. Therefore, the conditional density of the f/,’s 
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is given by 


/(«i «n|iV = n,yi,...,y„) 


r 

I E ri’‘K 

>i-i)j=> j=< 


1 (m ■l-l)J= 




I(Un < U’n) 


wlu're, analogously, u; = ef{y)/g(y). Using this dLsIribution wo can calculate, con- 
ditional on (N,y\,...,ys), the probability pi of the events {Ui < tcj and thus 
derive the weights of h{Yi) in the estimator T 2 . The calculations involve averaging 
over permutations of the realized sample and yield, for i < n, 


1-2 ri-2 , t-l n-l 

E n^’b E ri"'b 

J=<-1 ' J=< 


( 6 ) 


while p„ = 1 . The numerator sum is over all subset.s of { 1 , , . , , i — 1 , i 1 , , . . , n — 1 } 
of size t — 2, and the denominator sum is over all subsets of size t — 1, The following 
result therefore holds. 


Theorem 4.1. For N ^ n, the Rao-Blackwetlized version of (4) is given by 

1=1 

where pi is given by equation (6). 


5. Perfect sampling 

A perfect sampling algorithm for a Markov chain is an algorithm that produces a 
randotn variable that is exactly dis-tributed according to the stationary distribution 
of the Markov chain using variables that are (typically) generatetl from the condi- 
tional distribut ions of the chain. Perfect sampling in Markov chains originated with 
the ingenious “coupling from the past” algorithm of Propp and Wilson (1996). In 
practice, however, this algorithm has some drawbacks, such as - for example - not 
being interruptible and thiLs creating biases in the output in cases of interruption 
for insufficient memory and such. 

An alternative, interruptible, perfect .sampling algorithm was propexsed by Fill 
(1998). Since it is interruptible. Fill’s perfect sampling algorithm .seems to be some- 
what more practical than coupling from the past, although it requires delicate 
reversibility and coupling arrangements as shown below. 

Fill’s algorithm (see also Fill et al. 1999) can be described as follows: 

(a) Starting at an arbitrary state 0, run a finite state Markov chain (A'j) for t 
(fixed) steps, and record Xt — x. 

(b) Starting Markov chains in at all possible states at time t, run them in reversed 
time, coupled with the original chain. 

(c) If all these chains have coalesced, that is, if they all are in state 0 at time 0, 
then accept Xt = x as an observation from the stationary distribution. If not, 
rryect Xt and start again, possibly with different values of 0 and of t. 
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We now relate the result of the previous se<rtious to Fill's algorithm. 

o The surprising feature of this metliod is tliat it is a rejection algoritlmi with 
the clever twist that the probability of acceptance is exactly the probability 
of coalescence. This circumvents the problem of calculating this acceptance 
probability, which is typically not feasible. 

o Fill’s (1998) algorithm de|>ends on a parameter t, which is the number of 
forward stei>s in the Markov chain and which can l>c imulified at each iteration, 
by, for in-stance, doubling the value of t in a typical CFTP manner. Thus, 
the proposal distribution is changing at ever)' iteration, and the algorithm is 
covered by Theorem 8.1 (but is not covered by the standard Accept- Reject 
algorithm). 

o For Theorem 3.1 to validate Fill’s (1998) algorithm, the series log(l - fi) of 
acceptance probabilities f, must diverge. The difficulty then lies in establishing 
this without the f^’s l>eing available, which is the essence of Fill’s techni()ue. 
However, if the selection is periodic. Fill's algorithm is indeed valid, provided 
some fj’s are different from 0. In fact, in most practical cases Fill's algorithm 
will have an increasing acceptance rate, so will be covered by Theorem 3.1. 

o The application of Theorem 4.1 to Fill's algorithm requirtw .some further work 
since, in that case, the weights w, = /(jj)/A''(0,x,) are not directly available. 
Note however that in some setups K‘{0,x) may l)e known, while, in others, 
it can l>e estimated, since it is also erjual to the proltability of acceptance, 
that is, the probability of coalescence in state 0. ThiLs, we can implement the 
Rao Blackwellize<l improvement with estimatetl weights. 
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relational databases 
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Abstract: Data mining is a process of discovering useful patterns (knowledge) 
hidden in extremely large datasets. Classification is a fundamental data mining 
function, and some other functions can be reduced to it. In this paper we 
propose a novel classification algorithm (classifier) called MIND (MlNing in 
Databases). MIND can be phrased in such a way that its implementation is 
very easy using the extended relation^ll calculus SQL, and this in turn allows 
the classifier to be built into a relational database system directly. MIND is 
truly scalable with respect to I/O efficiency, which is important since scalability 
is a key requirement for any data mining algorithm. 

We have built a prototype of MIND in the relational database management 
system DB2 and have benchmarked its performance. We describe the working 
prototype and report the measured performance with respect to the previous 
metho<l of choice. MIND scales not only with the size of datasets but also 
with the number of processors on an IBM SP2 computer system. Even on 
uniprocessors, MIND scales well beyond dataset sizes previously published for 
classifiers. We also give some insights that may have an impact on the evolution 
of the extended relational calculus SQL. 


1. Introduction 

Information technology has developed rapidly over the last three decades. To make 
decisiorus faster, many companies have combined data from various sources in rela- 
tional databases [16]. The data contain patterns previously undeciphered that are 
valuable for business purposes. Data mining is the process of extracting valid, pre- 
viously unknown, and ultimately comprehensible information from large databases 
and using it to make crucial business decisions. The extracted information can be 
used to form a prediction or classification model, or to identify relations between 
database records. 

Since extracting data to files before running data mining functions would require 
extra I/O costs, users of IM as well as previous investigations [20, 19] have pointed 
to the need for the relational database management systems to have these functions 
built in. Besides reducing I/O costs, this approach leverages over 20 years of research 
and development in DBMS technology, among them are: 
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salary 

age 

credit rating 

65K 

30 

Safe 

15K 

23 

Risky 

75K 

40 

Safe 

15K 

28 

Risky 

lOOK 

55 

Safe 

60K 

45 

Safe 

62K 

30 

Risky 


Table 1: Training set 


• scalability, 

• memory hierarchy management [30, 33], 

• parallelism [5], 

• optimization of the executioiLs [6], 

• platform independence, and 

• client server API [27]. 

The classification problem can be described informally as follows: We are given a 
training set (or DETAIL table) consisting of many training examples. Each training 
example is a row with multiple attributes, one of which is a class label. The objective 
of classification is to process the DETAIL table and produce a classifier, which 
contains a description (model) for each class. The models will be used to classify 
future data for which the class labels are unknown (see [4, 28, 26, 9]). 

Several classification models have been proposed in the literature, including neu- 
tral network, decision trees, statistical models, and genetic models. Among these 
models, decision tree model is particularly suited for data mining applications due 
to the following reasons: (1) ease of construction, (2) simple and easy to under- 
stand, and (3) acceptable accuracy [29]. Therefore, we focus on decision tree model 
in this paper. A simple illustration of of training data is shown in Table 1. The ex- 
amples reflect the past experience of an organization extending credit. From those 
examples, we can generate the classifier shown in Figure 1. 

Although memory and CPU prices are plunging, the volume of data available 
for analysis is immense and getting larger. We may not assume that the data are 
memory-resident. Hence, an important research proijlem is to develop accurate clas- 
sification algorithms that are scalable with respect to I/O and parallelism. Accuracy 
is known to be domain-specific (e.g., insurance fraud, target marketing). However, 
the problem of scalability for large amounts of data is more amenable to a gen- 
eral solution. A classification algorithm should scale well; that is, the classification 
algorithm should work well even if the training set is huge and vastly overflows in- 
ternal memory. In data mining applications, it is common to have training sets with 
several million examples. It is observed in [24] that previously known classification 
algorithms do not scale. 

Random sampling is often an effective technique in dealing with large data 
sets. For simple applications who.se inherent structures are not very complex, this 
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Figure 1: Decision tree for the data in Table 1 


approach is efficient and gives good results. However, in our case, we do not favor 
random sampling for two main reasons: 

1. In general, choosing the proper sample size is still an open question. The 
following factors miust be taken into account: 

• The training set size. 

• The convergence of the algorithm. Usually, many iterations are needed 
to process the sampling data and refine the solution. It’s very difficult 
to estimate how fast the algorithm will give a satisfactory solution. 

• The complexity of the model. 

The best known theoretical upper bounds on sample size suggest that the 
training set size may need to be immense to assure good accuracy [13, 21]. 

2. In many real applications, customers insist that all data, not just a sample 
of the data, mu.st be processed. Since the data are usually obtained from 
valuable resources at considerable expense, they should be used as a whole 
throughout the analysis. 

Therefore, designing a scalable classifier may be necessary or preferable, although 
we can always use random sampling in places where it is appropriate. 

In [24, 29, 18], data access for classification follows “a record at a time” acce.ss 
paradigm. Scalability is addressed individually for each operating system, hardware 
platform, and architecture. In this paper, we introduce the MIND (MINing in Data- 
bases) chussifier. MIND rephrases data classification as a classic database problem 
of summarization and analysis thereof. MIND leverages the extended relational cal- 
culus SQL, an industry standard, by reducing the solution to novel manipulations 
of SQL statements embedded in a small program written in C. 

MIND scales, as long as the database primitives it uses scale. We can follow 
the recommendations in [3, 22] that numerical data be discretized .so that each 
attribute has a reasonable number of distinct valm^s. If so, operations like his- 
togram formation, which have a significant impact on performance, can be done in 
a linear number of I/Os, usually requiring one, but never more than two passes 
over the DETAIL table [36]. Without the discretization, the I/O performance 
bound has an extra factor that is logarithmic but fortunately with a very large 
base M/B, which is the number of disk blocks that c:an fit in internal mem- 
ory. 
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One advantage of our approach is tliat its implementation is easy. We have 
implemented MIND as a stored procedure, a common feature in modern DBMSs. 
In addition, since most modern database servers have very strong parallel query 
processing capabilities, MIND runs in parallel at no extra cost. A salient feature of 
MIND and one reason for its efficiency is its ability to do classihration without any 
update to the DETAIL table. 

We analyze and comiwire the I/O complexities of MIND and the previous 
method of choice, the interrating method called SPRINT [29]. Our theoretical 
analj’sis mid experimental results show that MIND scales well whereas SPRINT 
can exhibit quadratic I/O times. 

We describe our MIND algorithm in the next section; an illustrative example 
is given in Section 4. A theoretical iterformance analysis is given in .Section ,5. 
We revisit MIND algorithm in Sf'ction 6 itsing a general extension of current SQL 
standards. In Section 7, we presimt our ex|M'rimetil al results. VV’e make concluding 
remarks in Section 8. 


2. The algorithm 
2.1. Overview 

A decision tree classifier is built in two phas<>s: a growth phase and a pruning phase. 
In the growth phas«', the tree is built by recursively partitioning the data mitil each 
partition is either ‘‘pure” (all members belong to the smne class) or sufficiently 
small (according to a parameter set by the user). The form of the split iLsixl to 
partition the data depends upon the type of the attribute ased in the split. Splits 
for a numerical attribute .4 are of the form value(.A) < x, where z is a value in 
the domain of A. Splits for a categorical attribute A are of the form value(A) G S, 
where S is a subset of domain(A). We consider only binary s])lits as in [24, 29] for 
purpose of comparisons. After the tree has been fully grown, it is pruned to remove 
noise in order to obtain the final trw chussifier. 

The tree growth phase is computationally much more expensive than the sutxa-- 
quent pruning phase. The tree growth phase acces.ses the training set (or DETAIL 
table) multiple times, whereas the pruning |)hase only neetLs to access the fully 
grown decision tree. We therefore focius on the tret' growth phasj>. The following 
pseudo-code gives an overview of our algorithm: 

GrouiTVeeCIVainihgSet DETAIL) 

Initialize tree T mid put all of records of DETAIL in the root; 
while (some leaf in T is not a STOP node) 
for each attribute i do 

form the dimension table (or histogram) DIMi\ 
evaluate gini index for each mn\-STOP loaf at each split value 
with resjiect to attribute i\ 
for each non-STOP leaf do 
get the overall host split for it; 

partition the records and grow the tree for one more level according to the 
best splits: 

mark all small or pure leaves as STOP nodes; 
return T ; 
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2.2. Leaf node list data structure 

A powerful method called SLIQ was proposed in [24] as a semi-scalable classification 
algorithm. The key data structure used in SLIQ is a class list whose size is linear 
in the number of examples in the training set. The fact that the class list must be 
memory-resident puts a hard limitation on the size of the training set that SLIQ 
cun hiuidie. 

In the improved SPRINT classification algorithm [29], new data structures at- 
tribute list and histogram are proposed. Althoiigh it is not net:essary for the at- 
tribute list data structure to be memory-resident, the histogram data structure 
mast be in memory to insure good performance. To perform the split in [29], a hash 
table whose size is linear in the number of examples of the training set is used. 
When the hash table is too large to fit in memory, splitting is done in multiple 
steps, and SPRINT dots) not .scale well. 

In our MIND method, the information we need to evaluate the split and perform 
the partition is stored in relations in a database. Tints we can take advantage of 
DBMS functionalities and memory management. The only thing we need to do is 
to incorporate a data structure that relates the database relations to the growing 
classification tree. We assign a unique number to each node in the tree. When 
loading the training data into the database, imagine the addition of a hypothetical 
column le.af.num to each row. For each training example, leaf.num will always 
indicate which leaf node in the current tree it belongs to. When the tree grows, 
the leaf.num value changes to indicate that the record is moved to a new node by 
applying a split. A static array called LNL ^leaf node list) Is used to relate the 
leaf.num value in the relation to the corresponding node in the tree. By using a 
labeling technique, we insure that at each tree growing stage, the nodes always have 
the identification numbers 0 through N — I, where N is the number of nodes in the 
tree. LNL[i\ is a pointer to the node with identification number «. For any record 
in the relation, we can get the leaf node it belongs to from its leaf.num value and 
LNL and hence we can get the information in the node (e.g. split attribute and 
value, number of examples belonging to this node and their class distribution). 

To insure the performance of our algorithm, LNL is the only data structure that 
needs to be memory-resident. The size of LNL is equal to the number of nodes in 
the tree, .so LNL can always l>e stored in memory. 

2.3. Computing the gini index 

A splitting index is used to choose from alternative splits for each node. Several 
splitting indices have recently been proposed. We use the gini index, originally 
pro|M)se<l in [4] and used in [24, 29], because it gives acceptable accurary. The 
accuracy of our classifier is therefore the same as those in [24, 29]. 

For a data set S containing N examples from C cla.sses, gini{S) is defined as 

c 

gini{S) = \-'^p; (1) 

i=l 

where p, is the relative frequency of class i in S. If a split divides S into two subset 
S'l and S 2 , with sizes N\ and N 2 respectively, the gini index of the divided data 
!!A"i„pii,(5) is given by 

.9tn»»piit(5) = ^gini{Si) + ^gini(S2) (2) 
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The attribute containing the split point achieving the smallest gini index value 
is then chosen to split the node [4]. Computing the gini index is the most expensive 
part of the algorithm since finding the best split for a node requires evaluating the 
gini index value for each attribute at each possible split point. 

The training examples are stored in a relational database system using a table 
with the following schema: DETAIL{attr\, attr^i . • . , attrn, class, leaf .num), where 
attri is the ith attribute, for 1 < i < n, class is the classifying attribute, and 
leaf -num denotes which leaf in the classification tree the record belongs to. In 
actuality leaf .num can be computed from the rest of the attributes in the record 
and does not need to be stored explicitly. As the tree grows, the leaf -num value 
of each record in the training set keeps changing. Because leaf -num is a computed 
attribute, the DETAIL table is never updated, a key reason why MIND is efficient 
for the DB2 relational database. We denote the cardinality of the class label set 
by C, the number of the examples in the training set by N, and the number of 
attributes (not including class label) by n. 


3. Database implementation of MIND 

To emphasize how easily MIND is embeddable in a conventional database system 
using SQL and its accompanying optimizations, we describe our MIND components 
using SQL. 


3. 1 . Numerical attributes 

For every level of the tree and for each attribute attri, we recreate the dimension 
table (or histogram) called DlMi with the schema DIM i{leaf -num, class, attri, 
count) using a simple SQL SELECT statement on DETAIL: 

INSERT INTO DIMi 

SELECT leaf -num, class, attVi , COUNT ( * ) 

FROM DETAIL 

WHERE leaf -num <> STOP 

GROUP BY leaf -num, class, attVi 

Although the number of distinct records in DETAIL can be huge, the maximum 
number of rows in DIMi is typically much less and is no greater than (#leaves in 
tree) x (^distinct values on attri) x (#distinct classes), which is very likely to be 
of the order of several hundreds [25]. By including leaf -num in the attribute list 
for grouping, MIND collects summaries for every leaf in one query. In the case that 
the number of distinct values of attri is very large, preproc<?ssing is often done in 
practice to further discretize it [3, 22). Discretization of variable values into a smaller 
number of classes is sometimes referred to as “encoding’' in data mining i>ractice [3j. 
Roughly speaking, this is done to obtain a measure of aggregate behavior that may 
be detectable [25]. Alternatively, efficient external memory techniques can be used 
to form the dimension tables in a small mimber (typically one or two) linear pas.ses, 
at the passible cast of some added complexity in the application program to give 
the proper hints to the DBMS, as suggested in Section 5. 

After populating DIMi, we evaluate the gini index value for each leaf node at 
each possible split value of the attribute i by performing a series of SQL operations 
that only involve accessing DIMi. 
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It is apparent for each attribute i tliat its DIM j table may be created in one pass 
over the DETAIL table. It is straightforward to schedule one query per dimension 
(attribute). Completion time is still linear in the nurnl>er of dimensions. Commercial 
DBMSs store data in row-major sequence. I/O eflBciencies may be obtained if it is 
(xissible to create dimeasion tables for all attributes in one pass over the DETAIL 
table. Concurrent scheduling of the queries populating the DIM j tables Ls the simple 
approach. Existing buffer management schemes that rely on I/O latency api>ear 
to synchronize access to DETAIL for the different attributes. The idea Ls that 
one query piggy- backs onto another query’s I/O data stream. Results from early 
experiments are encouraging [31]. 

It is also pos.sible for SQL to be extended to insure that, in addition to optimizing 
I/O, CPU processing is also optimized. Taking liberty with SQL standards, we write 
the following query as a proposetl SQL o|>erator: 


SELECT FROM DETAIL 

INSERT INTO DIM,{leaf.num, class, altru C0mT(*) 
WHERE predicate. 

GROUP BY leaf .num, class, attri] 

INSERT INTO D/Mal/ca/.mon, class, offrj, COUNTC*) 
WHERE predicate 

GROUP BY leaf -Tium, class, attr 2 ) 

INSERT INTO DIM „{leaf .num, class, attCn, CWUll*) 
WHERE predicate 

GROUP BY leaf .num, class, attrn] 


The new operator forms multiple groupings concurrently and may allow further 
RDBMS query optimization. 

Since such an operator is not supported, we make ust^ of the object extensions 
in DB2, the user-defined function (udf) (32, 10, 17], wliich is another rea.son why 
MIND is efficient. User-defined functions are us<kI for association in [2j. User-defined 
function is a new feature provided by DB2 version 2 [10, 17). In DB2 version 2, 
the functions available for use in SQL statements extend from the system built-in 
functions, .such as avg, min, max, sum. to more general categories, such as user- 
defined functioas (udf). An external udf Ls a function that is written by a user in 
a host programming language. The CREATE FUNCTION statement for an extertuil 
function telLs the system where to find the code that implements the function. In 
MIND we use a udf to acciunulate the dimension tables for all attributes in one 
pass over DETAIL. 

For each leaf in the tree, possible split values for attribute i are all dist inct values 
of attci among the records that belong to this leaf. For each possible split value, we 
need to get the class distribution for the two parts partitioned by tliLs valvie in order 
to compute the corresponding gini index. We collect such distribution information 
in two relations, UP and DOWN . 

Relation UP with the schema UP(leaf .num, attri, class, count) can be gener- 
ated by performing a self-outer-join on DIMi’. 
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INSERT INTO UP 

SELECT di.leaf jnum, di.attri,di.class, SUVl{d-2. count) 

FROM (FULL OUTER JOIN DIMi dx, DIhf i d-2 
ON di.leaf.num = d-z-leaf .num AND 
d 2 .attri < dx.attTi AND 
di. class = d2.class 

GROUP BY d\.leaf .num,d\.attr i,d\. class) 

Similarly, relation DOWN can be generated by just changing the < to > in the 
ON clause. We can also obtain DOWN by using the information in the leaf node 
and the count cohmm in UP without doing a join on DIMi again. 

DOWN and UP contain all the information we need to compute the (fini index 
at each possible split value for each current leaf, but we need to rearrange them in 
some way before the gini index is calculated. The following intermediate view can 
be formed for all possible classes k: 

CREATE VIEW C k -UP {leaf .num, nttci, count) AS 
SELECT leaf .num, attci, count 
FROM UP 
WHERE class = k 

Similarly, we define view Ck-DOWN from DOWN. 

A view GINI. VALUE that contains all gini index values at each possible split 
point can now be generated. Taking liberty with SQL syntax, we write 

CREATE VIEW GINI .VALUE{leaf .num, attrugini) AS 
SELECT ui. leaf .num, Ui-attCi, fgini 

FROM Cl. UP ui,... ,Cc-UP uc, Cl. DOWN d,, . . . ,Cc-DOWN dc 
WHERE u\.attri = • • • = ucMttri = di.attXi = • • • = dc.attri AND 

ui. leaf .num = • • • = uc-lcjif.num = di.leaf.num = • • • = di.leaf.num 

where fgin% is a function of U}. count, .. ., Un-count, di.count, . .. , d„.ccnint accord- 
ing to (1) and (2). 

W'e then create a table MIN. GINI with the schema MIN. GINI {leaf .num, 
attr.narne, attr.value, gini): 

INSERT INTO MIN .GINI 
SELECT leaf .num, : i, attri, gini 
FROM GINI.VALUE a 
WHERE a.gini={SELECT MIN(<?im) 

FROM GINI.VALUE b 

WHERE a.leaf -TiUTH = b. leaf .num) 


Table MIN. GINI now contains the best split value and the corresponding gini 
index value for each leaf node of the tree with respect to attvi. The table formation 
querj- has a nested subquery in it. The performance and ojjtimization of such queries 
are studied in [6, 26, 15). 

We repeat the above procedure for all other attributes. At the end, the la^st 
split value for each leaf node with respect to all attributes will be collected in table 
MIN .GINI , and the overall best split for each leaf is obtained from executing the 
following: 
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CREATE VIEW BEST .SPLIT(leaf .num, aUr.name, attr.value) AS 

SELECT leaf.num, attr^name, attr.value 

FROM MIN.GINI a 

WHERE a.^ni=(SELECT HIN(^nt) 

FROM MIN.GINI b 

WHERE a.leaf .num = b.leaf .nuin) 


3.2. Categorical attributes 

For categorical attribute t, we form DIMi in the same way as for numerical at- 
tributes. DIMi contains all the information we need to compute the gini index for 
any subset splitting. In fact, it is an analog of the count matrix in [29], but formed 
with set-oriented operators. 

A possible split is any subset of the .set that contains all the distinct attribute 
values. If the cardinality of attribute i is m, we need to evaluate the splits for all the 
2’" subsets. Those subsets and their related counts can be generated in a recursive 
way. The schema of the relation that contains all the fc-sets is Sk.IN (leaf .num, 
class, vi,V 2 ,...,Vk, count). Obviously we have DIMi = Si. IN. Sk.lN is then gen- 
eratwl from Si. IN and Sk-i.IN as follows; 

INSERT INTO Sk.lN 

SELECT p. leaf .num, p.class.p.vi, ..., p.Vk-\, g.vj, p.count -I- q. count 
FROM (FULL OUTER JOIN Sk-i-IN p, Si. IN q 
ON p.leaf .num — q. leaf .num AND 

p. class = q.class AND 

q. vi >p.Vk-i) 

We generate relation Sk-OUT from Sk-IN in a manner similar to how we gen- 
erate DOWN from UP. Then we treat Sk-IN and Sk-OUT exactly as DOWN and 
UP for numerical attributes in order to compute the gini index for each A--set split. 

A simple observation is that we don’t need to evaluate all the subsets. W'e only 
need to compute the A’-sets for A = 1, 2, . . . ,[m/2J and thus save time. For large m, 
greerly heuristics are often itsed to restrict search. 

3. 3. Partitioning 

Once the best .split attribute and value have been found for a leaf, the leaf is split 
into tw'o children. If leaf .num is stored explicitly as an attribute in DETAIL, then 
the following UPDATE performs the split for each leaf: 

UPDATE DETAIL 

SET leaf. num = Partition{attri, ... ,attr„, class, leaf .num) 

The iLser-defined function Partition defined on a record r of DETAIL as follows: 
Partition (record r) 

Use the leaf .num value of r to locate the tree node n that r belongs to: 

Get the best split from node n; 

Apply the split to r, grow a new child of n if necessary; 

Return a new leaf .num according to the result of the split; 


Copyrighted material 


Scalable mining for classification rules in relational databases 


357 


attri 

attr2 

class 

leaf .num 

65K 

30 

Safe 

0 

15K 

23 

Risky 

0 

75K 

40 

Safe 

0 

15K 

28 

Risky 

0 

lOOK 

55 

Safe 

0 

60K 

45 

Safe 

0 

62K 

30 

Risky 

0 


Table 2: Initial relation DETAfL with implicit leaf.nurn 

LNL 

O — 


Figure 2; Initial tree 


leuf.num 

attri 

class 

count 

0 

15 

2 

2 

0 

60 

1 

1 

0 

62 

2 

1 

0 

65 

1 

1 

0 

75 

1 

1 

0 

100 

1 

1 


Table 3: Relation DIM i 


However, leaf-num i« not a stored attribute in DETAIL because updating the 
whole relation DETAIL is expensive. We observe that Partition is merely applying 
the current tree to the original training set. W’e avoid the update by replacing 
le.af.num by function Partition in the statement forming DIMi. If DETAfL is 
stored on non-updatable tapes, this solution is reciuired. It is irnpt)itant to note 
that once the dimension tables are created, the gini iiulex computation for all 
leaves involves only dimension tables. 

4. An example 

We illustrate our algorithm by an example. The example training set is the same 
tis the data in Table 1. 

Phase 0: Load the training set and initialize the tree and LNL. At this stage, 

relation DETAIL, the tree, and LNL ai’c shown in Table 2 and Figure 2. 

Phase 1: Form the dimension tables for all attributes in one pass over DETAIL 

using user-defined function. The result dimension tables are show in Table 3 -4. 

Phase 2: Find the best splits for current leaf nodes. A best split is found through 

a set of operat ions on relations as described in Section 2. 

First we evaluate the gini index value for attvi. The procetlure is depicted in 
Table 5-13. 
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leaf .num 

attr2 

class 

count 

0 

23 

2 

1 

0 

28 

2 

1 

0 

30 

1 

1 

0 

30 

2 

1 

0 

40 

1 

1 

0 

45 

1 

1 

0 

55 

1 

1 


Table 4: Relation DIM 2 


leaf .num 

attri 

class 

count 

0 

15 

1 

0 

0 

15 

2 

2 

0 

60 

1 

1 

0 

60 

2 

2 

0 

62 

1 

1 

0 

62 

2 

3 

0 

65 

1 

2 

0 

65 

2 

3 

0 

75 

1 

3 

0 

75 

2 

3 

0 

100 

1 

4 

0 

100 

2 

3 


Table 5: Relation UP 


We can see that the best splits on the two attributes achieve the same gini index 
value, so relation BEST. SPLIT is the same as MIN. GINI except that it does not 
contain the column gini. We store the best split in each leaf node of the tree (the 
root node in this phase). In ca.se of a tie for best split at a node, any one of them 
(attr -2 in our example) can be chosen. 

Phase 3: Partitioning. According to the best split found in Phase 2, we grow the 

tree and partition the training set. The partition is reflected as leaf .num uptlates 
in relation DETAIL. Any new grown node that is pure or “small enough” is marked 
and reassigned a special leaf .num value STOP .so that it is not processed further. 
The tree is .shown in Figure 3 and the new DETAIL is shown in Table 14. Again, 
note leaf .num is never stored in DETAIL, so no update to DETAIL is necessary. 

Phase 4: Repeat Phase 1 through Phase 3 until all the leaves in the tree become 

STOP leaves. The final tree and DETAIL are shown in Figure 4 and Table 15. 

5. Performance analysis 

Building classifiers for large training sets is an I/O bound application. In this section 
we analyze the I/O complexity of both MIND and SPRINT and compare their 
performances. 

As we described in Section 2.1, the classification algorithm iteratively does two 
main operations: computing the .splitting index (in our case, the gini index) and 
l>erfortning the partition. SPRINT [29] forms an attribute list (projection of the 
DETAIL table) for each attribute. In order to reduce the cast of computing the 
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leaf jnum 

atti'i 

class 

count 

0 

15 

1 

4 

0 

15 

2 

1 

0 

60 

1 

.3 

0 

60 

2 

1 

0 

62 

1 

.3 

0 

62 

2 

0 

0 

65 

1 

2 

0 

65 

2 

0 

0 

75 

1 

1 

0 

75 

2 

0 


Table 6: Relation DOWN 


leaf.nuni 

atti'\ 

count 

0 

15 

0.0 

0 

60 

1.0 

0 

62 

1.0 

0 

65 

2.0 

0 

75 

3.0 

0 

100 

4.0 


Table 7: Relation C\-UP 


leaf-num 

attr\ 

count 

0 

15 

2.0 

0 

60 

2.0 

0 

62 

3.0 

0 

65 

3.0 

0 

75 

3.0 

0 

100 

3.0 


Table 8: Relation C 2 -UP 


gini index, SPRINT presorts each attribute list and maintains the sorted order 
throughout the course of the algorithm. However, the use of attribute lists com- 
plicates the partitioning operation. When updating the leaf information for the 
entries in an attribute list corresponding to some attribute that is not the split- 
ting attribute, there is no local information available to determine how the entries 
should be partitioned. A hash table (whose size is linear in the number of training 
examples that reach the node) is repeatedly queried by random access to determine 
how the entries should be partitioned. In large data mining applications, the hash 
table is therefore not memory- resident, and several extra I/O passes may be neerled, 
resulting in highly nonlinear performance. 

MIND avoids the external memory thrashing during the partitioning phase 
by the use of dimension tables DIMi that are formed while the DETAIL table, 
consisting of all the training examples, is streamer! through memory. In practice, 
the dimension tables will likely fit in memory, as they are much smaller than the 
DETAIL table, and often preprocessing is done by discretizing the examples to 
make the number of distinct attribute values small. While vertical partitioning of 
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leaf.num 

attvi 

count 

0 

15 

4.0 

0 

60 

3.0 

0 

62 

3.0 

0 

65 

2.0 

0 

75 

1.0 


Table 9: Relation Ci.DOlVN 


leaf .num 

attri 

count 

0 

15 

1.0 

0 

60 

1.0 

0 

62 

0.0 

0 

65 

0.0 

0 

75 

0.0 


Table 10: Relation C 2 -DOWN 


leaf .num 

attri 

gini 

0 

15 

0.22856 

0 

60 

0.40474 

0 

62 

0.21428 

0 

65 

0.34284 

0 

75 

0.42856 


Table 11: Relation GINI. VALUE 


DETAIL may also be ttsed to compute the dimension tables in linear time, we 
show that it is not a must. Data in and data archived from conunercial data- 
bases are mostly in row major order. The layout does not appear to hinder perfor- 
mance. 

If the dimension tables cannot fit in memory, they can be formed by sorting in 
linear time, if we make the weak assumption that {M/BY ^ D/B for some small 
positive constant c, where D, A/, and B are respectively the dimension table size, the 
internal memory size, and the block size [7, 36]. Tliis optimization can be obtained 
automatically if SQL has the multiple grouping operator proposed in Section 3.1 
and with appropriate query optimization, or by appropriate restructuring of the 
SQL operations. The dimension tables themselves are used in a stream fashion 
when forming the UP and DOWN relations. The running time of the algoritlun 
thus scales linearly in practice with the training set size. 

Now let’s turn to the detailed analysis of the I/O complexity of both algorithms. 
V\’e will use the parameters in Table 16 (all sizes are measured in bytes) in our 
analysis. 

Each record in DETAIL has n attribute values of size r„, plus a class label that 
we assume takes one (byte). Thus we have r = nr„ -f 1. For simplicity we regard To 
as some unit size and thus r = 0(n). Each entry in a dimension table consists of 
one node number, one attribute value, one class label and one count. The largest 
node number is 2^, and it can therefore be stored in L bits, w'hich for simplicity 
w'e assume can fit in one word of memory. (Typically L is on the order of KL20. If 
desired, we can rid ourselves of this assumption on L by rearranging DETAIL or a 
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leaf.num 

attr.name 

attr. value 

gini 

0 

1 

62 

0.21428 


Table 12: Relation MIN.GINI after attr\ is evaluated 


leaf.num 

attr.name 

attr. value 

gini 

0 

1 

62 

0.21428 

0 

2 

30 

0.21428 


Table 13: Relation MIN.GINI after attri and attr 2 are evaluated 

LNL 



Figure 3: Decision tree at Phase 3 


copy of DETAIL so that no leaf.num field is needed in the dimension tables, but 
in practice this is not neederl.) The largest count is N, so rj = O(logfV). Counts 
are used to record multiple instances of a common value in a compressed way, so 
they always take less space than the original records they represent. We thus have 

Dk < min{nW, PC2‘rrf}. (3) 

In practice, the second expression in the min term is typically the smaller one, but 
in our worst-case expressions below we will often bound Dk by nN. 

Claim 1. If all dimension tables fit in memory, that is, Dk < M for all k, the I/O 
complexity of MIND is 



which is essentially best possible. 

Proof. If all dimension tables fit in memory, then we only need to rearl DETAIL 
once at each level. Dimension tables for all attributes are accumulated in memor)' 
when each DETAIL record is read in. When the end of DETAIL table is reached, 
we'll have all the unsorted dimension tables in memory. Then sorting and gini index 
computation are performed for each dimension table, best split will be found for 
each current leaf node. 

The I/O cost to read in DETAIL once is rN/D = 0(nN/B), and there are L 
levels in the final clas.sifier, .so the total I/O cost is 0(LnN j B). □ 

Claim 2. In the case when not all dimension tables fit in memory at the same 
time, but each individual dimension table does, the I/O complexity of MIND is 

• (5) 
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:M)2 


attri 

attr2 

class 

leaf.nurn 

65K 

30 

Safe 

1 

15K 

23 

Risky 

1 

75K 

40 

Safe 

2^ STOP 

loK 

28 

Risky 

1 

lOOK 

55 

Safe 

2^STOP 

60K 

45 

Safe 

2=^STOP 

62K 

30 

Risky 

1 


Table 14: Relation DETAIL with implicit leaf.nurn after Phase 3 



Figure 4: Final decision tree 


atti’i 

attr2 

class 

leaf.nurn 

65K 

30 

Safe 

4=^ STOP 

15K 

23 

Riskv 

3=^ STOP 

75K 

40 

Safe 

STOP 

15K 

28 

Risky 

3=^ STOP 

lOOK 

55 

Safe 

STOP 

60K 

45 

Safe 

STOP 

02K 

30 

Risky 

3=>STOP 


Table 15: F'inal relation DETAIL with implicit leaf.nurn 


Proof. In the case when not all dimension tables fit in memory at the same time, but 
each individual dimension table does, we can form, use and discard each dimension 
table on the fly. This can be done by a single pass through the DETAIL table when 
M /n > B (which is always true in practice). 

.MIND keeps a buffer of size 0{M/n) for each dimeiLsion. In scanning DETAIL, 
for each dimension, its buffer is lused to store the accumulat('d information. When- 
ever a buffer is full, it is written to disk. When the scanning of DETAIL is fin- 
ished, many blocks have been obtained for each dimension basetl on which the 
final dimension table can be forme<l easily. For example, there might be two entries 
(1, 1, l,count\), (1, 1, l^counto) in two blocks for altri. They are corresponding to 
an entry with leaf.nurn — 1, class = 1, attri = 1 in the final <limension table 
for attvi and will become a entry (1, 1,1, count i -f count 2 ) in the final dimension 
table. All tho.se blocks that corresponds to one dimension are collectively called an 
intermediate dimension table for that dimension. 
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M 

size of internal memory 

D 

size of disk block 

N 

# of rows in DETAIL 

n 

# of attributes in DETAIL (not including class label) 

c 

# of distinct class labels 

L 

depth of the final classifier 

D, 

total size of all dimension tables at depth k 

V 

# of distinct values for all attributes 

T 

size of each record in DETAIL 

To 

size of each attribute value in DETAIL (for simplicity, 
we assume that all attribute values are of similar size.) 

rd 

size of each record in a dimension table 

rn 

size of each record in a hash table (used in SPRINT) 


Table 16: Parameters used in analysis 


Now the intermediate dimension table for tlie first attribute is read into memory, 
summarized, and sorted into a final dimension table. Then MIND calculates the gini 
index values with respect to this dimension for each leaf node, and keei>s the current 
minimum gini index value and the corresponding {attribute. name, attribute. value) 
pair in each leaf node. When the calculation for the first attribute is done, the 
in-memory dimension table is discarded. MIND repeats the same procedure for the 
.second attribute, and .so on. Finally, we get the Ijest splits for all leaf nodes and 
we are ready to grow the tree one more level. The I/O cost at level k is scanning 
DETAIL once. i>lus writing out and reading in all the intermediate dimension tables 
once. We denote the total size of all intermediate dimension tables at level k by 
D'l^. Note that the intermediate dimeiLsion tabUw are a compressed version of the 
original DETAIL table, and they take much less space than the original records 
they represent. So we have 

D; < nN. 

The I/O cast at each level is 

\ 0<k<L ) ^ ' 

In the very unlikely scenario where M/n < B, a total of logjjy^^, u passes over 
DETAIL are needed, resulting in a total I/O complexity in (5). □ 

Now let’s consider the worst case in which some individual dimension tables do 
not fit in memory. We employ a merge sort process. An interesting point is that the 
merge sort process here is different from the traditional one: After several passes in 
the merge sort, the lengtlis of the runs will not increase anymore: they are upper 
bounded by the number of rows in the final dimension tables, w'hose size, although 
too large to fit in memory, is typically small in comparison with N. 

We formally define the special sort problem. We adopt the notations used 
in [35]: 


N = problem size (in units of data items), 

M = internal memory size (in units of data items). 
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B = block size (in units of data items), 
m = number of blocks that fits into internal memory, 


where I < B < M < N. 

The special sort problem can be defined as follows: 

Definition 1. There are N'{N' <§; N) distinct keys, {k\,k 2 , . . . and we 

cissume k\ < k -2 < ... < k}g> for simplicity. We have N date items {kr.^i),caunti), 
for 1 < z < iV, 1 < x{i) < N*. 

The goal is to obtain N' data items with the key in sorted increasing order and 
the corresponding count summarized; that is, (ki,COUNTi), where 


Proof. We perform a modified merge sort procedure for the special sort problem. 
First N/M sorted “run.s” are formed b}' repeatedly filling up the internal mem- 
ory, sorting the records according to their key values, combining the records with 
the same key and summarizing their counts, and writing the results to disk. This 
requires 0{^) I/Os. Next m runs are continually merged and combined together 
into a longer sorted run, until we end up with one sorted run containing all the N' 
records. 

In a traditional merge .sort procedure, the crucial property is that we can merge 
m runs together in a linear number of I/Os. To do so we simply load a block from 
each of the runs and collect and output the B smallest elements. We continue this 
proce.ss until we have proces.sed all elements in all runs, loading a new block from 
a run every time a block becomes empty. Since there are 0(log„, levels in 
the merge process, and each level requires O(^) I/O operations, we obtain the 
0{^ log,,, complexity for the normal sort problem. 

An important difference between the special sort procedure and the traditional 
one is that in the former, the length of each sorted run will not go beyond N' while 
in the latter, the length of sorted runs at each level keeps increasing (doubling) 
until reaching N. 

In the special sort procedure, at and after level k = [logj^z/^ N'/B], the length 

of any run will be bounded by N' and the number of runs is bounded by 
(For simplicity, we will ignore all the floors and ceilings in the following discussion.) 
From level k + I on, the operation we perform at each level is basically combining 
each m runs (each with a length less than or equal to N') into one run whose length 
is still bounded by N'. We repeat this operation at each level until we get a single 
run. At level k + i, we combine jjk (!ti runs into runs and the I/O at this 


COUNTi = ^ countk 


\<k<N.x(k)=i 


for 1 < / < N' . 


Lemma 1. The I/O complexity of the special sort problem is 



( 6 ) 


level is 
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N/D 


We will finish the combining procedure at level k + p where p = log„, 
n' = N' jB. So the I/O for the whole special sort procedure is: 


, N„ 1, 

B B m rn \ 


m \ mj m.f~ ‘ \ tn / 


2-log^n +-(^l + -j — 


1/m 

«2|log„,r.' + | 

= o(glo6„n' + f) 

Now we are ready to give the I/O complexity of MIND in the worst case. 
Theorem 1. In the worst case the I/O complexity of MIND is 


□ 


which is 



(7) 

(8) 


In most applications, the log 
becomes 


term is negligible, and 



the I/O complexity of MIND 

(9) 


which matches the optimal time of (f). 


Proof, This is similar to the proof in Claim 2. At level k of the tree growth phase. 
MIND first forms all the intermediate dimension tables with total size in ex- 

ternal memory. This can be done by a single pass through the DETAIL table, 
as follows. MIND keeps a buffer of size 0{M/n) for each dimeiusion. In scanning 
DETAIL, MIND accumulates information for each dimension in its correspond- 
ing buffer; whenever a buffer is full, it is written to disk. When the scanning of 
DETAIL is finishcKl, MIND performs the special merge sort procedure for the disk 
blocks corresponding to all (not individual) dimension tables. At the last level of 
the special sort, the final dimension table for each attribute will lx> formetl one by 
one. MIND calculates the gini index values with respect to each dimension for each 
leaf node, and keeps the current minimum gini index value and the corresponding 
(attribute. name, attribute. value) pair in each leaf node. W'hen the calculation for 
the last attribute is done, we get the best splits for all leaf nodes and we are ready 
to grow the tree one more level. 

The I/O cost at level k is scanning DETAIL once, which is 0(nN/B), plus the 
cost of writing out all the intermediate dimension tables once, which is bounded by 
0(nN/B), plus the cost for the special sort, which is 0(^ logA//B Dh/B), 
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So the I/O for all levels is 
Lnl 



which is 


□ 


Now we analyze the I/O complexity of the SPRINT algorithm. There are two 
major parts in SPRINT: the pre-sorting of all attribute lists and the construct- 
ing/searching of the corresponding hash tables during partition. Since we are deal- 
ing with a very large DETAIL table, it is unrealistic to assume that N is small 
enough to allow hash tables to be stored in memory. Actually those hash tables 
need to be stored on dtsk and brought into memory during the partition phase. It 
is trtie that hash tables will become smaller at deeper levels and thus fit in memory, 
but at the early levels they arc very large; for example, the hash table at level 0 
has N entries. 

Each entry in a hash table contains a ttd(transaction identifier) which is an 
integer in the range of 1 to N, and one bit that indicates which child this record 
should be partitioned to in the next level of the classifier. So we have 


We can estimate when the hash tables will fit in memory, given the optimistic 
assumptions that all memory is allocated to hash tables and all hash tables at each 
no<ie Imve ef|iial size; that is, a hash table at level k contains lV/2* entrits. Thus, a 
hash table at level k fits in memory if rsN/2*’' < A/, or 


For sufficiently large k, (10) will be satisfied, that is, hash tables become smaller at 
deei)er nodes and thus fit in memory. But it is clear that even for inotlerately large 
detail tables, hash tables at upper levels will not fit in memory. 

During the partition phase, each non-splitting attribute list at each node needs 
to be partitioned into two |>arts based on the corresponding hash table. One way 
to do this is to do a random hash table search for each entry in the list, but tliis is 
very expensive. Fortunately, there is a better way; First, we bring a large |>ortion 
of the hash table into memory. The size of this portion is limited only by the 
availability of the internal memorj'. Then we scan the non-splitting list once, block 
by block, and for each entry in the list, we search the in-memory portion of the 
hash table. In this way, the hash table is swapped into memory only once, and each 
non-splitting attribute list is scamied N/M times. For even larger N, it is better 
to do the lookup by batch sorting, but that approach is completely counter to the 
founding philosophy of the SPRINT algorithm. 

A careful analysis gives its the following estimation: 

Theorem 2. The. I/O complexity of SPRINT is 


1 -I- log N 



( 10 ) 



( 11 ) 
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Proof. To perform the pre-sort of the SPRINT algorithm, we need to read DETAIL 
once, write out the unsorted attribute lists, and .sort all the attribute lists. So we 
have 

10^r.or,=0(^^logl^^y 

Prom level 0 through level k — 1, hash tables will not fit in memory. At level t 
(0 < i < — 1), SPRINT will perform the following operations: 

1. Scan the attribute lists one by one to find the best split for each leaf node. 

2. According to the best split found for each leaf node, form the hash tables and 
write them to disk. 

3. Partition the attribute list of the splitting attribute for each leaf node. 

4. Partition the attribute lists for the n - 1 non-splitting attributes for each leaf 
node. 


Among these operations, the last one incurs the most I/O cost and we perform it 
by bringing a portion of a hash table into memory first. The size of this portion 
is limited only by the availability of the main memory. Then we scan each non- 
splitting list once, block by block, and for each entry in the list, we search the 
in-memory portion of the hash table and decide which child this entry should go in 
the next level. In this way, the hash table is swapped into memory only once, and 
the non-splitting list is scanned multiple times. The I/O cost of this operation is 


O 



where hi is the number of portions we need to partition a hash table into due to 
the limitation of the memory size. 

From level k to level L the hash table will fit in memory, and the I/O costs for 
those levels is 0((L — k)riN/D) , which is significantly smaller than those for the 
previous levels. 

So the I/O cost of SPRINT becomes 


O 


( nN , N ^ 

0<j<k-l 


riNh, (L - k)nN 
B B 


Note that we have 

rnN N /l-t-logN\ 

■ “ 2’ A/ 2‘Af V 8 y 

So 

N (\ + \of,N\ ^ ^ ^ (\ + \ogN\ 

S' ^<k, ■ - « V » > 


Applying (13) to (12), we get the I/O complexity of SPRINT in (1 1). 


( 12 ) 


(13) 

□ 


Examination of (8) and (11) reveals that MIND is clearly better in ternus of 
I/O performance. For large N, SPRINT does a quadratic number of I/Os, whereas 
MIND scales well. 


Copyrighted material 


368 


A/. Wang et at. 


6. Algorithm revisited using schema SQL 

111 Section 3.1, we described the MIND algorithm using SQlj-like statements. Due to 
the limitation of current SQL standards, most of those SQUike statements are not 
supported directly in today’s DBMS products. Therefore, we need to convert them 
to currentlj' supported SQL statements, augmented with new facilities like user 
defined functions. Putting logic within a user-defined fimction hides the operator 
from query optimization. If classification was a subquery or part of a large query, 
it would not be possible to obtain all join reorderings, thereby risking suboptimal 
execution. 

Current SQL standards are mainly designed for efficient OLTP (On-Line Trans- 
actional Processing) queries. For non-OLTP applications, it is true that we can 
usually reformulate the problem and express the solution using standard SQL. 
However, this approach often results in inefficiency. Extending current SQL with 
ad-hoc constructs and new optimization considerations might solve this problem 
in some particular domain, but it is not a satisfactory solution. Since supporting 
OLAP (On-Line Analytical Processing) applications efficiently is such an important 
goal for today’s RDBMSs, the problem deserves a more general solution. 

In [23] an extension of SQL, callerl SdiemaSQL, is proposetl. SchemaSQL offers 
the capability of uniform manipulation of data and meta-data in relational multi- 
database systems. By examining the SQL-like queries in Section 3.1, we can see 
that this capability is what we need in the MIND algorithm. To show the power of 
extended SQL and the flexibility and general flavor of MIND, in this section, we 
rewrite all the queries in Section 3.1 using SchemaSQL. 

First we give an overview of the syntax of SchemaSQL. For more details .see 
[23]. 

In a standard SQL query, the tuple variables are declared in the FROM clause. .A 
variable declaration has the form {range) (var). For example, in the query below, 
the expression student T declares T as a variable that ranges over the (tuples of 
the) relation student{student.id, department, GPA): 

SELECT studentJd 
FROM sUident T 

WHERE T.deparlment - CS AND T.GPA = A 

The SchemaSQL syntax extends SQL syntax in several directions: 

1. The federation consists of databases, with each databa-se consisting of rela- 
tions. 

2. To permit meta-data queries and reconstruction views, SchemaSQL permits 
the declaration of other types of variables in addition to the tuple variables 
I>ermitted in SQL. 

3. Aggregate operations are generalized in SchemaSQL to make horizontal and 
block aggregations pi>ssible, in addition to the usual vertical aggregation in 
SQL. 

SchemaSQL permits the declaration of variables that can range over any of the 
following five sets: 

1. names of databases in a federation, 

2. names of the relations in a database. 
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3. names of the columns in the scheme of a relation, 

4. tuples in a given relation in database, and 

5. values appearing in a column corresponding to a given column in a relation. 

Variable declarations follow the same syntax as {range){var) as in S(JL, where var 
is any identifier. However, there are two major differences: 

1. The only kind of range permitted in SQL is a set of tuples in some relation 
in the database, where in SchemaSQL any of the five kinds of range can be 
used to declare variables. 

2. The range specification in SQL is made using constant, i.e., an identifier 
referring to a specific relation in a database. By contrast, the diversity of 
ranges possible in SchornaSQL permits range specifications to be nested, in 
the sense that it is possible to say, for example, that is a variable ranging 
over the relation names in databa.se D, and that T is a tuple in the relation 
(h'noted by R. 

Range specifications are one of the following five types of expressions, where dh, 
rel, col are any constant or variable identifiers. 

1. The expression — * denotes a range corresponding to the set of database names 
in the federation. 

2. Tlie expre.ssion db — > denote's the set of relation name's in the databa.se db. 

3. The' e^xf)re.s.sion db :: rel denotes the set of names of column in the schema 
of the relation rel in the database db. 

4. db :: rel denotes the set of tuples in the relation rel in the database db. 

5. db :: rel.col denotes the set of values appearing in the column namexl col in 
the relation rel in the database dh. 

For example, e:e)iisider the clau.se FROM dbl — ► R, dbl :: R T. It declares R as 
a variable ranging over the set of relation names in the database dbl anel T as a 
variable ranging ewer the tuples in eae:h relation R in the database dbl 

Now we are ready to rewrite all the SQL-like epieries in Section 3.1 using 
SchemaSQL. Assume that oeir training set is stored in relation DETAIL in a data- 
base nameel FACT. W'e first gemerate all the dimension tables with the schema 
{leaf .num, rla.ss, attr.val, coimt) in a database named DIMENSION , using a .sim- 
ple SchemaSQL statement: 

CREATE VIEW DIMENSION :: R{leaf.num,cla.ss,attr.val,count) AS 
SELECT T.leaf.num, T.ela.ss, T.R, COUNT ( * ) 

FROM FACT :: DETAIL R. 

FACT :: DETAIL T 
WHERE R <>' cla.ss' AND 

R <>' leaf ^num AND 
T.leaf .num. <> STOP 
GROUP BY T. leaf jnum,T. class, T.R 
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The variable R is declared iis a column name variable ranging over the column 
names of relation DETAIL in the database FACT, and the variable T is declared as 
a tuple variable on the same relation. The conditions on R in the WHERE clause make 
the variable R range over all columns except the columns named class and leaf.num. 
If there are n columns in DETAIL (excluding columns class and leaf .mini), this 
query generates n VIEWs in database DIMENSION , and the name of each VIEW is the 
same as the corresponding column name in DETAIL. Note that the attribute name 
to relation name transformation is done in a very natural way, and the formation 
of multiple GROUP BYs is done by involving DETAIL only once. 

Those views will be materialized, so that in the later operations we do not need 
to access DETAIL any more. 

Relations corresponding to UP with the schema {leaf.num, attr.val, class, 
count) can be generated in a database named UP by performing a self-outer-join 
on dimension tables in database DIMENSION: 

CREATE VIEW UP :: R{leaf.num, attr.val, class, count) AS 
SELECT d\.leaf .num, d\. attr.val, di .class , SUHid^.cmint) 

FROM (FULL OUTER JOIN DIMENSION :: R di, 

DIMENSION :: R da, 

DIMENSION — R 
ON d\. leaf .num = d2.lc.af .num AND 
di. attr.val < d2.attr.val AND 
d\. class = d2. class 

GROUP BY d\. leaf .num, d\. attr.val, d\.class) 

The variable R is declared as a relation name variable ranging over all the 
relations in database DIMENSION . Variables d\ and da are both tuple variables 
over the tuples in each relation R in database DIMENSION . For each relation in 
database DIMENSION , a self-outer-join is performed according to the conditions 
specified in the query, and the result is put into a VIEW with the same name in 
database UP. 

Similarly, relations corresponding to DOWN can be generated in a database 
named DO WN by just changing the < to > in the ON clau.se. 

Database DOWN and database UP contain all the information we need to com- 
pute all the gini index values. Since standard SQL only allows vertical aggregations, 
we need to rearrange them before the gini index is actually calculated as in Sec- 
tion 3.1. In SchemaSC^L, aggregation operations are generalized to make horizontal 
and block aggregations possible. Thus, we can generate views that contain all gini 
index values at each possible split point for each attribute in a database named 
GINI .VALUE directly from relations in UP and DOWN: 

CREATE VIEW GINI .VALUE :: R{leaf .num, attr.val, gini) AS 
SELECT u. leaf .num, u. attr.val, fgini 
FROM UP:: Ru, 

DOWN :: R d, 

UP ^ R 

WHERE u. leaf .num = d. leaf .num AND 
u. attr.val = d. attr.val 
GROUP BY u. leaf .num, u. attr.val 

where fgini is a function of u.class, d.class, u.count, d.count according to (1) 
and (2). 
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R is declared as a variable ranging over the set of relation names in database 
UP, u is a variable ranging over the tuples in each relation in database UP, and 
d is a variable ranging over the tuplra in the relation with the same name as R in 
database DOWN . Note that the set of relation names in databases UP and DOWN 
are the same. For each of the relation |>airs with the same name in UP and DOWN , 
this statement will create a view with the same name in database GINI-VALUE 
according to the conditions specified. It is interesting to note that is a block 
aggregation function instead of the lusual vertical aggregation function in SQL. 
Each view named R in database GINI .VALUE contains the gini index value at 
each possible split point with respect to attribute named R. 

Next, we create a single view MIN .GINI with the schema MIN .GINI {leaf .num, 
attr.name, attr.val, gini) in a database named SPLIT form the multiple views in 
database GINI. VALUE: 

CREATE VIEW SPLIT :: MIN .GINI (leaf .num, attr.name, attr.val, gini) AS 
SELECT T\. leaf .num, Ri,T\.attr.val, gini 
FROM GINI. VALUE ^ Ri, 

GINI. VALUE :: Ri T, 

WHERE T\.gini =(SELECT MIN(r 2 .ffim) 

FROM GINI. VALUE -> Ri, 

GINI.VALUE D 2 T 2 

WHERE /?, = Ri AND 

Ti-leaf.num = T 2 . leaf .num) 

Ri and R 2 are variables ranging over the set of relation names in database 
GINI.VALUE. T\ and T 2 are tuple variables ranging over the tuples in relations 
specified by R\ and R 2 , respectively. The clause R\ = Ri enforces R\ and Ri 
to be the same relation. Note that relation name R\ in database GINI .VALUE 
becomes the column value for the column name<l attr.name in relation MIN .GINI 
in database SPLIT. Relation MIN. GINI now contains the best split value and the 
corresponding gini index value for each leaf node of the tree with respect to all 
attributes. 

The overall best split for eardi leaf is obtained from executing the following: 

CREATE VIEW SPLIT .: BEST.SPLIT{ leaf .num, attr.name, attr.val) AS 
SELECT Ti.leaf .num, Ti.attr.name,T\. attr.val 
FROM SPLIT :: MIN. GINI Ti 
WHERE Ti.gini =(SELECT MINffftni) 

FROM SPLIT :: MIN. GINI Ti 
WHERE Ti.leaf .num = Ti.leaf .num) 

This statement is similar to the statement generating relation BEST. SPLIT in 
Section 3.1. T\ is declared as a tuple variable ranging over the tuples of relation 
MIN. GINI in database SPLIT. For each leaf .num, {attr.name, attr.val) |>air that 
achieving the minimum gini index value is inserted into relation BEST .SPLIT . 

We have shown how to rewrite all the SQL-like queries in MIND algorithm using 
SchemaSQL. In our current prototype of MIND, the first step, generating all the 
dimension tables from DETAIL, is most costly and all the later ste])s only need to 
access small dimension tables. We use udf to reduce the cost of the first step. All 
the SQL-like queries in Section 3.1 in the later steps are translated into etiuivalent 
SQL queries. Those translations usually lead to poor performance. But since those 
queries only access small relations in MIND, the performance loss is negligible. 
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While udf provides a solution to our classification algorithm, we a believe general 
extension of SQL is needed for efficient support of OLAP applications. 

An alternative way to generate all the dimension tables from DETAIL would 
be using the newly proposed data cube operator [14] since dimension tables are 
different subcubes. But it usually takes a long time to generate the data cube 
without precomputation and the fact that the leaf.num column in DETAIL keeps 
changing from level to level when we grow the tree makes precomputation infeasible. 

7. Experimental results 

There are two important metrics to evaluate the quality of a classifier: classification 
accuracy and classification time. We compare our results with those of SLIQ [24] 
and SPRINT [29]. (For brevity, we include onl)'^ SPRINT in this paper; comparisons 
showing the improvement of SPRINT over SLIQ are given in [29].) Unlike SLIQ and 
SPRINT, we use the classical database methodology of summarization. Like SLIQ 
and SPRINT, we u.se the .same metric {gini index) to choose the best split for each 
node, we grow our tree in a breadth-first fashion, and we prune it using the same 
pruning algorithm. Our classifier therefore generates a decision tree identical to the 
one produced by [24, 29] for the same training set, which facilitates meaningful 
comparisons of run time. The accuracy of SPRINT and SLIQ is discussed in [24, 29], 
where it is argued that the accuracy is sufficient. 

For our .scaling experiments, we ran our prototype on large data sets. The main 
cost of our algorithm is that we need to access DETAIL n times (n is the number 
of attributes) for each level of the tree growth due to the absence of the multiple 
GROUP BY operator in the current SQL standard. \W recommend that future DBMSs 
.support the multiple GROUP BY operator so that DETAIL will be accessed only once 
regardless of the number of attributes. In our current working prototype, this is done 
by using vuser-defined function as we de.scribed in Section 3.1. 

Owing to the lack of a classification benchmark, we used the synthetic database 
proposed in [1]. In this synthetic database, each record consists of nine attributes as 
shown in Table 17. Ten classifier functions are proposed in [1] to produce databases 
with different complexities. We run our prototype using function 2. It generates 
a database with two classes: Group A and Group B. The description of the class 
predicate for Group A is shown below. 

Function 2, Group A 

((age < 40) A (50K < .salary < lOOK)) V 
((40 < age < 60) A (75K < salary < 125K)) V 
((age > 60) A (25K < .salary < 75K)) 

Our experiments were conducted on an IBM RS/6000 workstation running AIX 
level 4.1.3. and DB2 version 2.1.1. We used training sets with sizes ranging from 
0.5 million to 5 million records. The relative response time and response time per 
example are shown in Figure 5 and Figure 6 respectively. Figure 5 hints that our 
algorithm achieves linear scalability with respect to the training set size. Figure 6 
shows that the time per example curve stays flat when the training set size increases. 
The corresponding curve for [29] appears to be growing slightly on the largest cases. 
Figure 7 is the performance compari.son between MIND and SPRINT. MIND ran 
on a processor with a slightly slower clock rate. We can see that MIND performs 
better than SPRINT does even in the range where SPRINT scales well, and MIND 
continues to scale well as the data sets get larger. 

We also ran MIND on an IBM multiproce.ssor SP2 computer system. Figure 8 
shows the parallel speedup of MIND. 
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attribute 

value 

salary 

uniformly distributed from 20K to 150K 

commission 

salary > 7oK => commission = 0 else 


uniformly distributed from lOK to 75K 

age 

uniformly distributed from 20 to 80 

loan 

uniformly distributed from 0 to 500K 

elevel 

uniformly chosen from 0 to 4 

car 

uniformly chosen form 1 to 20 

zipcode 

uniformly chosen from 10 available zipcodes 

hvalue 

uniformly distributed from 


O.SfclOOOOOto 1.5fcl00000, 


where A- G {0, . . . , 9} is zipcode 

hyear 

uniformly distributed from 1 to 30 


Table 17: Description of the synthetic data 



Training Set Size (in million) 

Figure 5: Relative total response time. The t/-value denotes the total response time 
for the indicated training set size, divided by the total response time for 5 million 
examples. 


Another interesting measurement we obtained from uniprocessor execution is 
that accessing DETAIL to form the dimension tables for all attributes takes 93%- 
96% of the total execution time. To achieve linear speedup on multiprocessors, it 
is critical that this step is parallelized. In the current working prototype of MIND, 
it is done by user-defined function with a scratch-pad accessible from multiple 
processors. 

8. Conclusions 

The MIND algorithm solves the problem of classification within the relational 
database management systems. Our performance meeisurements show that MIND 
demonstrates .scalability with respect to the number of examples in training sets 
and the number of parallel processors. We believe MIND is the first classifier to 
successfully run on datasets ol N = 5 million examples on a uniprocessor and 
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Tniiiiiig Set Siic (in millioa) 


Figure 6: Relative response time per example. The j/-value denotes the response 
time per example for the indicate<l training set size, divided by response time (jer 
example when processing 5 million examples. 
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Figure 7: Performaiice comparison of MIND and SPRINT 


yet demonstrate effectively non-increasing response time per example as a function 
of N. It also rims faster than previous algorithms on file systems. 

There are four reasons why MIND is fast, exhibits excellent scalability, and is 
able to handle data sets larger than those tackled before: 

1. MIND rephrases the data mining function classification as a classic DBMS 
problem of summarization and analysis thereof. 

2. MIND avoids any update to the DETAIL table of examples. This Is of sig- 
nificant practical interest; for example, imagine DETAIL having billions of 
rows. 
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Figure 8: Speedup of MIND for multiprocessors. The y-value denotes the total 
response time for the indicated training set size, divided by the total response time 
for 3 million examples. 

3. In the absence of a multiple concurrtmt grouping SQL operator, MIND takes 
ad^^ntage of the luser-defined function capability of DB2 to achieve the equiv- 
alent functionality and the resultant performance gain. 

4. Parallelism of MIND is obtained at little or no extra cost because the RDBMS 
parallelizes SQL queries. 

We recommend that extensions be made to SQL to do multiple groupings and 
the streaming of each group to different relations. Most DBMS operators currently 
take two streams of data (tables) and combine them into one. We believe that we 
have shown the value of an operator that takes a single stream input and produces 
multiple streams of outijuts. 
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A simple proof of a condition for 
cointegration 

T. W. Anderson^’* * 

StanfoTxl University 

Abstract: A simple proof is given for a theorem concerning the first differ- 
ence and some linear functions of a cointegrated autoregressive process being 
stationary. 


1. Introduction 

Many macroeconometric models are formulated in terms of autoregressive processes 
or autoregressive processes with moving average innovations. The most appropri- 
ate process in a given situation may not be stationary, but some linear relations of 
the components may be stationaiw; such a process is called cointegrated. Johansen 
(1995) has given alternative conditions for the cointegrated components and first 
differences of other components to be stationary. Here we give a proof of one con- 
dition that is more straightforward and transparent than what is in the literature. 
A /^-dimensional m-order autoregressive process {Yt} is defined by 

Yt = BiYt_i -f B2Yt_2 + . . . -f B„,Yt_n, + Zt, (f-l) 

where the Zf’s are independent unobservable innovations with £Zt = 0, EZtZ'f = E, 
and SZtYi_^ = 0, 0 < .s. Let 

B{A) = - A'»-*B, - ... - B„„ (1.2) 

and let the roots of |B(A)| = 0 be A,-, i = l,...,rnp. If |Ai| < 1, t = l,...,mp, 
the process {Yf} may be .stationary. If one or more of the roots are 1, the proce.ss 
is nonstationary, but some order of differencing may yield a stationary process. 
When soiijc linear functions of a nonstationary process are stationary, the model 
is called cointegrated. We call a process defined by the Equation (1.1) stationary 
if it is possible to aasign a distribution to (Y_,„ 4 .i, . . . , Y_i, Yo) such that (1.1) 
generates a process Y-„i+\,'^-m+ 2 , ■ • • that is stationary. Throughout this paper 
it is Jissumed that n of the roots are 1 and the other roots .satisfy |A,| < 1, f = 
ri -f 1, . . . , mp. 

An “error-correction form” of the autoregrt'ssive proce.ss is 

AY< = IIY(_i -f IIiAYf-i -f . . . -f n,„_i AYt_„i+i -+- Zt, 

where AY« = Y, - Yt_i, 

rij = — (Bj+i -f . . . -h B„,), j=l,...,m — 1, 

n = Bi 4- B 2 4- . . . -j- B„, — Ip. 

Note that Ilj = Ilj+i - Bj+i and II = -B(l). 

*This paper is dedicated to my friend and co-author Herman Rubin, who stimulated and 
educated me as well as collaborated with me. 

* Departments of Economics and Statistics, Stanford University, Stanford, C.\ 94305, USA. 
e-mail: tvaCstat.stanford.edu 

Keywords and phnues; autoregressive process, error correction form, stationarity. 
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Suiipose the rank of II is k. Then II can be written II = AT', wliere A and T 
are pxk matrices of rank k. Let Aj_ and Fi he px {p — k) matrices of rank p — k 
such that A'j^A = 0 and = 0. Then a necessary and sufficient condition that 
AYt and F'Yt are stationary is that 


rank 


m— 1 


A'x I- 


i=l 


= p-k 


( 1 . 6 ) 


(Theorem 4.2, Johansen (1995)]. Tlie proof of this statement involves an expansion 
of B(A) around A = 1. 

If {Yf} is stationary, it is said to be /(O). If {Yj} is not /(O), but {AYf} is 
stationary, the process {Yt} is said to be /(I). 

Corollary 4.3 of Johansen asserts that if k is the rank of II and k < p, then the 
multiplicity of A = 1 as a zero of |B(A)| is equal to p — k if and only if {Yt} is 7(1). 
The proof of this statement depends on his Theorem 4.2 and its proof. 

In this pai>er the condition is formulated as 


Rank Condition. There are n linearly independent solutions to 


uj'n = 0, 


(1.7) 


where n is the multiplicity of X = I as a root of the characteristic equation 
|B(A)| = 0. 

Let n independent solutions of (1.7) be iussemblwl into the matrix Hi = 
(a;i, . . . ,a;„); then Hill = 0 and the rank of Hi is n. 

2. First-order case 

First we treat the special case of m = 1. Then (1.1) is 

Yt = BiYt_i + Zt; (2.1) 

the error-correction form is 

AYt=nYt_i + Z„ (2.2) 


where II = Bi — I,,; and B(A) = AI,, — Bi. 

Theorem 1 (m. = 1). Suppose the Rxink Condition holds. Then the rank ofUis 
k = p — n, and there exists a p x k matrix H 2 such that 


H^n 


T 22 H.', 


(2.3) 


Y' 22 {k X k) is nonsingular, and H = (Hi, Ho) is nonsingular. Define 



■ Xit ■ 


■ n'lYi ■ 

w. — 

■ Wu ■ 


H',Z, 


. . 



, W t — 

W21 




(2.4) 


Then AXit,X 2 f defines a stationary process. 

Proof. Let Hj = (I„, H 21 ) and II' = (n'l, II 2 ), wiiere Ho is k x p. (The rows of Hi 
and the columns of II can be ordered so that Hu is nonsingular and can be set 
as I„.) Then the Rank Condition is 


o = H',n = (I,„H'2 i) 


Hi 

U2 


= III + 


(2.5) 
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whidi implies IIi = —021112 and 


n = 



Hi. 


Define O2 


II2 (p X k) and 


T22 = H2 




( 2 . 6 ) 


(2.7) 


Then (2.3) is satisfied. Note that T22 (fc x k) is nonsingular, that is, of rank k, 
bectmse if T22 were singular there would exist a fe-vector 7 such that 7'Tg2 = 0 
and then 7'Il2 would be another left-sided eigenvector of FI associated with the 
root 0, but that would imply more than n linearly independent vectors satisfying 
u)'n = 0 and hence more than n zeros of |B(A)| at A = 1, which is contrary to 
assumption. Note that (2.6) is a factorization 11 = AT' with F' = II2. 

The matrix O satisfies 


o'n 

O'B 


■ n', ■ 


-«2. 

^2 




H2 = 


o'(n-(-i) = 


i„ 0 

0 4>22 


0 0 1 r o', 

0 T22 J [ O2 

o' = 4»0'. 


TO', (2.8) 
(2.9) 


where $22 = T22 + It. Let II2 = (n2i,Il22). Then O is nonsingular becaitse 


I„ 0 


In 

n', 

1 _ 1 In n.' , 

— D 21 L 


^21 

n '2 

1 1 0 


= / 0. 


Hence (2.4) is a nonsingular linear transformation. 

The transformed process X, satisfies the autoregressive model 


X, = 4>X,_i+Wi, 
AX, = TX,_i + Wi, 


( 2 . 10 ) 


(2.11) 

( 2 . 12 ) 


where 


4' = 0'Bi(0')-' = 


I„ 0 

0 4-22 


(2.13) 


has eigenvalues X,, i = 1, . . . ,p, and 4>22 has eigenvalues Aj, » = n + 1, . . . ,p, and 
T = ^ - I,,. From (2.11) to (2.13) we obtain 


AXu 

X2, 


0 

4»22 


AX,,,_i ■ 


Wu 


+ 

W2, 


W„ 

^22X2 , (-1 -I- W 2 ( 


(2.14) 


as generating the process (AX',,, Xj,)'. Since the eigenvalues of the coefficient ma- 
trix in (2.14) are 0 of multiplicity n and Aj, i = n-l-1, . . . ,p, the process (AX',(, X2,)' 
is a stationary process. □ 

The transformation X, = 0'Y( is a change of coordinates such that the first 
n coortlinates of X, define a random waik, which is an 7(1) process. The other k 
coordinates define a stationary process. Thus {X,} is an 7(1) process; that is, AX, 
is an 7(0) process. The process Y( = (IT)”*X, is a mixture of att 7(1) and an 7(0) 

process. 
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3. General case 

Theorem 2. When the Rank Condition holds. 

AY, 

n^Y, 


(3.1) 


defines a stationary process. 


Proof. For arbitrary m the niOfiel.s (1.1) and (1.3) can be written in “stacked” form 
as 

Y, =B,Y,_, + Z, (3.2) 

and 


AY, =flY,_i +Z(, 


(3.3) 


where 



Y, 


z, 


B, 

B 2 

B,„-i 

B,„ ■ 


Y,-, 


0 


Ir 

0 

0 

0 

Y, = 

Y,-2 

11 

IN 

0 

,B, = 

0 

h • 

0 

0 


Y,_m+1 


0 


0 

0 


0 


and n = Bi - [See Anderson (1971), Section 5.3, for example.] Let B(A) = 
AImp — B]. Then |B(A)| = |B(A)|. We shall prove Theorem 2 by asing Theorem 1 
with Y, replaceri by Y,. 

Suppose that there are n linearly independent solutions to tD'n = 0. Let these 
solutions be assembled into the n x mp matrix fl, = (fl,,, . . . , flmi)- Then 


0 


fi,n 

Gjj(Bi — Ip)-t- G2[ , O j |B'2 — Gj, + Q.j, , . . . , 
G,,B„|.i - + G,„,, Gi,B,„ - n,„i . 


(3.5) 


This equation implies 

^fiii G,|B,n — (3.6) 

■ ^llB>n-y + Gm-j+I ~ J = 1, . . . , Ul — 1, (3.7) 

0 = n'„(B, -ip) + n'2, =n'iin. (3.8) 

It follows that 

g'i = n„[ip.-ni (3.9) 

Lemma. There is a pin x n matrix G, of rank n such that G|fi = 0 1 / and only 
if there is a p x n matrix G,, of rank n .such that G,,I1 = 0. 

Thus the Rank Condition on the rnp-dimensional Y, in terms of n is (■quivalent 
to the Rank Condition on Y,, where n is detined by (1.5). 
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It follows from Theorem 1 applied to (3.2) that the rank of II is A' = mp — n. 


Lot 


n = 


n 

n 


(Bj B2.f1 • • • B,,f_j.fj B,„.ff 

(Bj— B 2 .A: ... B„,.a; 


0 


-If 


0 


0 


0 


-I, 


(3.1U) 


= Tf22B2J 


'■}> ‘-I' 

where n.„ has n rows and { ).n denotes the first n rows of ( ) and ( ).k denotes the 
last A: rows of ( ). The pm x k matrix fl-j = B siitisfies 

(3.11) 

T 22 is nonsingular, and = (17i,Il2) is nonsingular. Define Xf = 17 Y( and 
W, = nZi. Then Xt = (X',t,X^ 2 <)' satisfies 

Xu = Xi,t_,+Wu, (3.12) 

X2t = ^22X2,t-l + W2t, (3.13) 

where the eigenvalues of ^'22 fire A,-, i = n + 1, . . . , mp, 

(Bi — Iy,).A,-Yf + B2.fcY(_i + . . . + B,„.*;Y(_„,4 .i 


X 2 , = a,Yt = 


and 


W^2f = 172Z( = 


Yt- Yt_, 

^ t—m+2 Yf — fn + i 


(B, - I,,).^ W* 
Wt 
0 


(3.14) 


0 


(3.15) 


Thus {Xu} is an 7(1) procx^ss of dimension n and {X 2 /,} is an 7(0) proce.ss of 
dimension k. ^ 

Now we want to transform { X» } so that k = p — n coordinates constitute the 
cointegrated part of {Yf} and the other coordinates are components of AYf,. . . , 
AYf_„,.fi- terms of Yf (3.12) can be written 

rn / m \ 

AY,_,+i = AY, = BnZ, = Wu- (3.1G) 


j=i 


j=2 


Let 


M = 


V 2 , = MX 2 , = 


h -Hik 

1 

1 

c 

1 



0 Ip 

0 

1 


0 0 

Ip 



TlkYt 1 



1 

G 

1 

.. 

<1 

, U 2 , = MW 2 , = 

w, 

AY,_,„4.2 



0 


(3.17) 


. (3.18) 
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Here 11,* denotes the last k rows of FI defined by (1.5); that is, II.* = II2 in (2.6). 
Let 0 = M^22M“’. Then V2t satisfies 

V2, =0V2,e-,+U2,. (3.19) 

The eigenvalues of © are A<, i = ri 4- l,...,mp. Hence defines a .stationary 
jrrocess. In fact 

00 

V2, = ^e-‘U2,e-.. (3.20) 

.=0 

Since the last m — 2 blocks of U2( are O’s, the last m — 2 blocks of (3.19) are 
identities. Tlie first k + p rows of (3.19) define a stationary process for n.*Y( 
and AY,. □ □ 

Discussion. The process {Y, } is cointegrated of rank k, and II,* is the cointe- 
grating matrix. 

The orthogonality conditions of Ax and Tx are equivalent to A] n = 0 and 
DTx = 0. Hence, Ax consists oi p — k left-sided characteristic vectors of II cor- 
res|M)nding to the characteristic root of 0 and lx consists of p — k right-sided 
characteristic vectors corresponding to the root of 0. The matrix I corresponds to 
D'j — n.^i . 

4. Inference 

The model (1.3) has the form of regression 


Y, = A,X„ A 2X2, -f Z„ (4.1) 

where A, Ls of rank k. The maximum likelihood estimator of A, under normal- 
ity of Z, is the reduced rank regression estimator introduced by Anderson (1951). 

Johansen (1988), (1995) also deriveri the estimator for (1.3) and gives some asymi>- 

totic theory suitable for the cointegrated model. Anderson (2000), (2001), (2002) 

has given more detiiils of the asymptotic theory. 
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Forecasting NBA basketball playoff 
outcomes using the weighted likelihood* * 

Feifang Hu* and James V. Zidek^ 

University oj Virginia and Untverstty of British Columbia 

Abstract: Predicting the outcome of a future game between two sports teams 
poses a challenging problem of interest to statistical scientists as well as the 
general public. To be effective such prediction must exploit special contextual 
features of the game. In this paper, we confront three such features and address 
the need to: (i) use all relevant sample information; (ii) reflect the home court 
advantage. To do so we use the relevance weighted likelihood of Hu and Zidek 
(2002). Finally we demonstrate the value of the method by showing how it 
could have been used to predict the 1996-1997 NBA Final series results. Our 
relevance likelihood-based method proves to be quite accurate. 


1. Introduction 

This paper demonstrates the lusti of weighted likelihood (WL) to predict tlie winner 
of 1996-1997 National Basketball Association (NBA) Finals between the Chicago 
Bulls and the Utah Jazz. However, as we try to indicate, the WL has much wider 
applicability inside as well outside the domain of sports. 

Statistical methods have been extensively u.sed in sports (Bennett 1998). Harville 
(1977) uses regression analysis to rate high school and college football teams based 
on observed score differences. In a later paper (Harville 1980), he develops a method 
for forecasting the point spread of NFL games by using similar techniques. In related 
papers, Schwertman et al (1996) and Carlin (1996) tackle NCAA basketball. Both 
papers (like this one) estimate the probability that team i beats j. They (unlike 
us) are based on pre-game information. The Brst uses a logistic regression analysis 
of win - loss records and various functions of seed numbers (that is ranks assigned 
to the teams going into a tournament), as a way of incorporating prior knowl- 
edge and expert opinion. The second extends earlier unpublished work of Schwert- 
man et al (1993) by using other external information such as “...the RPI index, 
Sagarin ratings, and so on..." in addition to seed numbers. Like Harville (1980) 
and Stem (1992), Carlin uses published point spreads to capture pregame infor- 
mation and does a linear regression analysis of observed point spreads on pregame 
information. Models derived from that analysis can be u.sed to predict game win- 
ners. 

Our approach, unlike those described above, does not attempt to take pre-game 
information into consideration although it may be possible to do that through the 
weights in the WL. That issue remains to be explored, hustead, our goal is to intro- 
duce the WL method and show how it can be used. No doubt improvements that 
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build oil earlier work could enhance the method. However, we do lussess our ap- 
proach against a logistical method that embraces the celebrated method of Bradley 
and Terry (1952) that also underlies the work of Schwertman et al. (1996). 

The genesis of our work lies in two statistical problems encountered in sports: 
(i) the prediction of the outcome of a future game between two specified sports 
teams; (ii) the assessment of the accuracy of this prediction. Since typically these 
two teams will not have met more than just a few times in the given season, little 
direct information will be available to the forecaster. The corusequent small sample 
size will make naive predictions inaccurate and the associated prediction intervals 
excessively large. 

Turning to the NBA Finals, we note that the winner is the team that wins a 
best of 7 series (that is, the first team to win four games). To predict that outcome, 
one might seiptentialiy lietermine the prediction probability of a Bulls’ win in each 
of a series of successive games. To find that probability, the 1996 1997 .season data 
would be used. However, the Bulls met the Jazz just twice, providing the only 
•‘direct" information available, in the terminology of Hu and Zidek (1993) and Hu 
(1994). However, that small sample cannot generate accurate predictions. 

To overcome this data deficiency, observe that the Bulls (like the Jazz) played 
82 games in the season (2 with the Jazz and 80 with other teams). The 160 games 
these two played agaiiLst other teams provide “relevant" information, in the Hu- 
Zidek terminolog)'. 

To ttse both the “direct" and “relevant” information in some simple yet flexible 
W'ay, Hu (1994) propos<>s the “relevance weighteri likelihood". Hu and Zidek (2002) 
extend that likelihood and Wang (2001) further extended it to get the “weighted 
likelihood (WL)”, the terminology' we use in this paper. 

The method of weighted likelihood has been applied to a neurophysiology ex- 
periment (Hu and Rosenljerger, 2000). In that jjaiMT, they find that both bias 
and mean square error are significantly reduced by using the weighted likelihood 
method. Hu and Zidek (2001) use the WL to predict the number of goals (with pre- 
diction intervals) for each of the Vancouver Canucks and Calgary Flames in their 
NHL games against each other during the 1996 1997 st'ason. They (Hu and Zidek 
2002) show how the V\’L can be luseti to construct generalizations of the classical 
Shewhart control charts. Their generalization includes the moving average and ex- 
ponentially moving average charts and allows for a variety of failure modes when 
proces.ses go out of control. This application introduces the weighted likelihood ra- 
tio test. In that same paper, they show how the James Stein estimator, including 
generalizatioas, can be found with the WL. 

A particularly important class of applications arise in estimating |>arameters 
that are interrelated, leading to natural relatioaships lunong the associated (xip- 
ulatioiLs and inducing transfers of information from their associated samples. Van 
Eeden an<l Zidek (2002) show how such interrelations may be exploiterl through 
the WL when the means of two normal populations with known variances are or- 
dererl. The analogous problem when the mean difference Is boundi-d is treated in 
Van Eislen aiul Zidek (2(MK1). Finally, we would mention an application to disease 
mapping in Wang (2001). 

In Section 2. we apply the WL in the NBA forecasting application above by 
taking advantage of si«>cial features of sports data. The maximum WL estimator 
(MWLE) is develo|)ed for predicting the result of a future game. The mean .square 
error of this MWLE is given. Moreover, we construct approximate confidence in- 
tervals using the asymptotic theory for the MWLE given by Hu (1997). 
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In Section 3, we apply the method developed in Section 2 to predict the 1996- 
1997 NBA playoff results, specifically for games involving the Chicago Bulls and 
the Utah Jazz. Our predictions agree quite well with the actual outcomes. 

To validate that positive performance assessment, in Section 3 we consider the 
playoff games played by the Bulls against each of three other teams, the Miami Heat, 
the Atlanta Hawks, and the New York Knicks. Similarly, playoff games between 
the Heat and Knicks are considered. These additional predictions are also in good 
agreement with the actual game outcomes. 

Many other approaches can be taken in our application. In Section 4, our method 
is shown to compare favorably with a “purpose built” competitor, an extension of 
the Bradley Terry model (Bradley and Terry 1952). Moreover, it proves to have 
all the flexibility and much of the simplicity of its classical predecessor proposed 
by Fisher. Thus, we are able to recommend it as a practical alternative to its 
competitors for the application considered. 

2. Sports data and the WL 

2. 1 . Contextual features 

Usually in sports, the outcome of any one game derives from the combined efforts 
of two teams that have seldom played each other before. Yet these games yield the 
only direct sample information available about the relative strength of these two 
teams. At the same time, each of these teams will have played many games against 
other teams thereby generating relevant (although not direct) sample information. 
The predictive probability of a win in the next game between these two teams, 
should combine both kinds of information. 

In some sports, the home team has a great advantage (see Section 3) that must 
be accounted for when the data are analyzed (although in their application, Hu 
and Zidek (2002) ignored that advantage). Finally, the outcome of any one game 
will depend on both the offensive and defensive capabilities of the teams involved. 
Satisfactory prediction of future games requires that we combine information about 
the offense and defense of the two teams involved in any specific game. 

2.2. The weighted likelihood 

To develop a statistical model for the analysis of sports data, one should recognize 
the distinctive contextual features described in the last subsection. Let Y>»b(/j) be 
a Bernoulli random variable that is 1 if team. A, wins against team B when B is at 
home. Similarly, let Yab{t') be a random variable that is 1 or 0 according as team A 
wins against team B when team B is at home. Note that YAsih) = 1 — Y/j 4 (r). As 
an approximation, assume the time series of Y’s for different games and team pairs 
are independent in this paper. Clearly, a more sophisticated approach like that of 
Hu, Rosenberger, and Zidek (2000) would allow dependent game outcomes. 

Suppo.se the {YAB{h)} and {Y 4 e(r)} have probability density functions 
f{y,PAB{h)) and f{y,PAB{r)) respectively. To predict the game result, (K 4 b(/i), 
YBA{r)) or (Y 4 fi(r), V b/i(/i)), we have to estimate the parameters pabO^) and 
PAB{r). 

To create the weights required in implementing the WL, we choo.se the same 
weight in the likelihood factor corresponding to each of the games A played against 
teams other than B, irrespective of the opponent. From Hu and Zidek (2002), we 
may use the weighted likelihood method to estimate the parameters PabW and 
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The log weighted likelihood of thus becomes 

kAB 

f{yAB(h),PAB(h)) + OAB(h) X] ^Ogf{yA(B)(h),PAB[h)) 

1*=1 A{B) 

+ 0AB{h) \ogf{y^A)B{h),pAB(l^)), ( 1 ) 

{A)B 

where kAB i-s the numljer of games that A against B at home; denotes tlie 

snru over all games that A played against teams other thrm B in the leagne with 
A at home and yA(B)(h) the corresponding binary game outcomes; 
sum over all games that B playerl against teams other than A when B is away 
and y(A)B{h) the corresponding outcomes. Let pAB ^^(h) be the corresponding 
maxitmmi weighted likelihood estimate (MWLE) o{pAB{h)- The MWLE of pab(t') 
can be defined in a similar way. 

We adopt the approximate Akaike criterion (Akaike, 1977, Akaike. 1985, and 
Hu and Zidek, 2002) to select the weights OAB^h) and 0AB(h) by minimizing with 
respect to both, 

E(pAB{h)-pAB(h)f. (2) 

The resulting optima will, however, depend on the unknown p’s being estimated. To 
address this problem we can use ‘plug - in’ estimators obtained in any reasonable 
way, for these p’s, to obtain OAB{h) and 0AB(h) from Equation (2). One possible 
way of doing this is demonstrated in Section 3. 

In most applications, we need confidence intervals (or the ecjuivalent) for the 
parameters. The impossibility of finduig exact confidence intervals based on the 
MWLE leads us to use approximate ones based on the asymptotic normality of the 
MWLE (see Theorem 5 of Hu, 1997). W’e obtain such a 95% confidence interTOl for 
PabW as 

[pab‘^‘^W - biasAB - lM\/varAB,PA^''^{b) + biasAB + 1.96\/rar^ j . (3) 

Here bias a B and vovab are the estimators of the bias and variance given in The- 
orem 5 of Hu (1997). With those estimates pabW and PBA(r), we can find the 
predictive probabilities of w-inning, lasing and draw'ing the game (along with their 
approximate confidence intervals) when a game is played at the home of Team A. 

3. Predicting the NBA playoff results 

In this section, we turn to the problem of predicting the outcomes of NBA playoff 
games. Our analysis concerns the 1996-1997 .sea.son. 

The home team advantage is significant in the NBA. We tested the null-hypothesis 
t)f no home team advantage against the alternative of a home team advantage and 
found a p-value of about 10“^ suggesting the need to .separate home and away 
games. 

To describe our application, let YabW ~ Bemoulli{pAB{h)) be independently 
distributerl random variabk* representing a “win" or “loss" by team A in any one 
game played against team B while A is at home. We first estimate the predictive 
probabilities pAB(h) and pBA(h) where ‘A’ and ‘B’ denote respectively the Chicago 
Bulbs and the Utah Jazz, two top NBA teams. 

The u.se of the weighted likelihood seems especially appealing here given the 
paucity of “direct" information about the relative strengtlis of A and B. In fact. 
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the Jazz played only one game in Chicago. The clasvsical likelihood leaves no chance 
of finding re^isonable j)arameter estimates. In contrast, the MWLE brings in infor- 
mation from games each of these teams played against others in the NBA. That is, 
the MWLE uses the information in the “relevant sample” in addition to that in the 
“direct sample”. 

We find the MWLE of pAB{h) (from the weighted likelihood (1)) to be 

PAB^^^{h) = yAD{h)+aAB{h){y,x^i){h)-yAB{h)) 

+ pAB{ft){y{A)n{f>) - yAB{fl)), (4) 

where yAB{fO denotes the fraction of wins for A in the games played against 

B during the sciison with A at home. The ?7a(/?)(/') represents the corresponding 
fraction of wins for A in the fcA{B){^) games played against teams other than B 
with A at home. 

By using the approximate Akaike criterion with a reasonable cystimate PabW 
(described below), an optimal weight may be estimated by 


... ^’AB{h)[V(A)B{fi) + {y{A)B0^) - PAB{l^)){y(A)B{f^) - yA{B){l^))\ 

OAB{n) = 7T— T; (-^) 


and 


Mh) = 


VABih)[VA(B)V^) + {yA(B)(h) - PAB{h)){yA(B){fl) ~ y(A)B{fi))] 

C+D 


(6) 


where 


VABih) = 


K4(C)(/') = 


ViA)B{h) = 


C = 


/Mg(/0(1 -PabW) 

ICAB{fl) 

y.4(/j)(^)(f - yA(B)W) 

^■.4(£?)(/0 

y{A)B{ff)i^ - y(A)B{f>)) 
f^(A)B{h) 

^'"AB{h)[V^A)B{fl) + VA(B)(/0 + (?7(/1)b(/0 - .V.A(B)(/'))^] 


and 


D — yA(B)ih){y(A)BO>) - PABifl)) +V(^A)BW{yA{B){^) - PAB{h))^ 

+ ^A(B){h)V(^A)B{fi)- 


The corresponding mean square error of the MWLE may be estimated by 


MSEmwle 


[aABVl){yA(B){ll) - Pab'''^W) 

+ 0ABmy(A)BW-p^S"-^(,h))f 


+ 0abW 


yA(B)ih){l - y,4(B)(/?)) 

f^A(B){fi) 

y{A)D{h){l - #(.4)b(/Q) 

^(.4)b(/0 


f.MWLE 

+ {1- CtABih) - PAB{h)r-^ 


k.AB{h) 
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The 95% confidence interval of pAn{h) based on the MWLE would be: ~ 

biasAB{h) - l.96y/varAB{h ), + biasAB(h) + lMy/varAB{h)], where 

biasABih) = \QAB{h){yA(B)ih) - Pab^^W) + 0 AB{h){y(A)B{h) - PAS^^{h))\ 


and 


varAB{h) 


-'2 ,i^^yA(B){f^){^ - yA{B){h)) 

f^AB\'^) 


+ 0abW 


f^A{B){h) 

y(A)B{h){l -y(A)BW) 


l^(,A)B{h) 


. (. - , 

kAB\h) 


We now describe how we found the plug-in estimates, the optimal weights, the 
win probabilities and the corresponding confidence intervals by considering the 
Fiulls against the Jazz while the Bulls are at home. 

During the regular season, the Bulls played 41 games at home. One game was 
against the Jazz and the Bulls won this game. So kAB = 1 and Yab = 1- The 
Bulls played 40 games against teams other than the Jazz and won 38 of these 
games. Thus, kA(B) = 40 and V^(o) = 0.95. The Jazz played 40 {k^A)B = 40) 
games against teams other than the Bulls on road and won 26 of these games. 
^(A)B = 1 ~ 26/40 = 0.35, For this case, the plug-in estimate. 


PAB{h) 


kABYAB + kA(B)yA{B) + k^A)By{A)B 
kAB + kA{B) + k(A)B 


1 + 38 +14 
1 + 40 + 40 


81 


= 0.6543. 


The corresponding values in equation (5) and (6) can be calculated by us- 
ing above results. And the values are: K 4 /?(/i) = 0.2262, VA(B){k) = 0.0011875, 
y(A)B{h) = 0.0056875, C = 0.082987 and D = 0.000637. Subsistute these values 
into (Kpiation (5) and (6), we get the optimal weights: 


aAB{h) = 0.50925, and 0ab{Ii) = 0.4831. 


The MWLE in (1) is then 

PAB^^^ih) = 0.66. 

The corresponding mean scjuare error, bias and variance of this MWLE are 

MSEmwie = 0.001653, biasAB{h) = 0.002 and varAB{k) = 0.001648. 

The 95% confidence interval of pAB{h) based on this MWLE is then [0.58,0.74). 

The above MWLE is based on the games with all teams that the Bulls played at 
home or the Jazz played on the road. Each game has the same weight in the weighted 
likelihood. This seems unreasonable because some of the teams are significantly 
weaker than others. Now we only use the teams (10 teams in 1996/97 season) which 
won at least 50 games in the season. By using the games with these 10 teams, we 
calculate the win probabilities as well as the confidence intervals, which is denoted 
by MWLEl. 

Before the 1996-1997 finals between the Bulls and the Jazz, both teams had 
played the first and second round as well as the conference finals. This additional 
information is used in coastructing MWLE2, 
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Table 1: The Bulls predictive win probabilities (with mean square error) and confi- 
dence intervals based on MWLE, M^\'LE1 and MWLE2 for a future game between 
the Bulls and the Jazz during the 1996-1997 season. 



MWLE 

MWLEl 

MWLE2 

At Chicago 
95% C.I. 
At Utah 
95% C.I. 

0.66 (0.002) 
(0.58,0.74) 
0.40 (0.002) 
(0.32. 0.48( 

0.77 (0.007) 
(0.60,0.94) 
0.36 (0.008) 
(0.16,0.55) 

0.75 (0.004) 
(0.62,0.89) 
0.34 (0.004) 
(0.21,0.47) 


Table 2: The predictive probabilities of a Bulls’ win against the Jazz together with 
confidence intervals for MWLE, MWLEl anti MWXE2 in the 1996 1997 Final. 



Game # 

Game 4 Game 5 Game 6 Game 7 

Total 90-1-%“ C.I. 

MWLE 

Bulls’ Win 

0.07 

0.11 

0.21 

0.21 

0.61 

(0.43,0.77) 


Jazz Win 

0.04 

0.13 

0.11 

0.11 

0.39 

[0.23,0.561 

MWLEl 

Bulls’ Win 

0.07 

0.11 

0.27 

0.26 

0.71 

(0.30,0.95) 


Jazz Win 

0.02 

0.11 

0.08 

0.08 

0.29 

10.05.0.701 

MWLE2 

Bulls’ Win 

0.07 

0.10 

0.26 

0.26 

0.69 

(0.37,0.92) 


Jazz Win 

0.02 

0.12 

0.09 

0.08 

0.31 

[0.08,0.63] 


W'e now use MWLE. MWLEl and MWLE2 to predict the 1996 1997 Finals 
between the Bulls and Jazz. We report the point estimates of the probabilities, the 
mean square errors and the confidence intervals of PAa(h) in Table 1 

Based on the probabilities and the confidence intervals of Table 1, we can find 
the probabilities with which the Bulls (and the Jazz) will win the Finals in four, five, 
six and seven games. Also we can calculate the total win probabilities for the Bulls 
agaiiLSt the Jazz based on their home and away win probabilities given by ear:h of 
the three estimation methods. Confidence intervals for these win probabilities may 
be obtained as well. In Table 2 where the results are reported, and in the tables that 
follow, that interval is obtained for any pair of teams say A and B from the 95% 
asymptotic intervals for A’s home- and A’s away-win-against-B probabilities. Since 
those intervals are stochastically dependent, we use a Bonferonni argument and 
obtain an asymptotic interval of confidence at least 90%. In obtaining that interval, 
we rely on the heuristically obvious fact that the overall win probability must be a 
monotonically increasing function of the home and away win probabilities. 

Table 1 indicates general agreement between MWLEl and MWLE2. But MWLE 
gives a much smaller estimator of a Chicago win at home. Both MWLEl and 
MWLE2 predict that the Bulls would win the Finals with high probability. Also 
MWLEl and MWLE2 predict the Bulls will win at Game 6. These predictions agree 
with the actual result: the Bulls won the Finals in six games. 

To explore the performance of our method further, we have also calculated 
prediction probabilities for other pairs of teams, the Bulls vs. the Miami Heat, the 
Atlanta Hawks, the New York Knicks as well as the Miami Heat against the Knicks. 
The detailed results are not reported in this paper. 
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For the Bulls against the Miami Heat, both MWLEl and MWLE2 also predict 
that most probably the Bulls will win at Game 5. That prediction proved to be cor- 
rect in the playoff. When the Bulls play the Atlanta Hawks, MWLEl and MWLE2 
also predict a Bulls’ win at game 5 with the highest probabilities (0.43 and 0.40). 
(In the playoffs the Bulls did win at game 5.] 

Our analysis shows that a Heat - Knicks game will be close. M WLE and MWLEl 
predict that the Heat have a slight advantage in the playoffs, while MWLE2 favors 
the Knicks slightly. In fact, the Heat won at game 7. However, an accident occurred 
in that series leading to a suspension of several New York players in games 6 and 7. 
Undoubtedly this influenced the outcome. 

Overall, MWLE is more conservative in that its predictions are closer to 0.5 
than the other methods. This is because MWLE uses some not-so-relevant infor- 
mation from games involving weak teams. When the Bulls and the Jazz play weak 
teams each wins. Thus, these data will tend to increase both of their success rates. 
However, since they both enjoy that benefit, the relevant difference in their esti- 
mated strengths will diminish, making the MWLE tend toward 0.5. MWLEl and 
MWLE2 agree with each other, the latter giving slightly more precise predictions 
(as measured by the length of the associated predictive intervals in Table 1) because 
it incorporates the playoff games. 

The Biills and the Knicks did not meet in the playoffs. However, MWLEl and 
MWLE2 predict a hypothetical Bulls’ win with probabilities 0.75 and 0.78 had they 
met. Both predict a hypothetical Bulls’ win for the .series at game 5. 

4. Concluding remarks 

The method in this paper provides guidelines for the development of a prediction 
strateg}'. Its implementation, more specifically the construction of weights entails 
tlie incorporation of any special features tliat may obtain when the game is playetl. 
For example, one might need to incorporate the knowledge that certain key players 
cannot play in that game. [This last consideration did ari.se in the plaj^off between 
the Miami Heat and the New York Knicks.] 

The need for the incorporation of such features was reaffirmed by an unpublished 
analysis carried out in the summer of 1998 by Farouk Nathoo. In that analysis, he 
twice simulated the entire 1997/1998 season based on the previous year’s results. 
In his report he compared the simulation results with the actual results. Among 
other things he found the fraction of wins for each of the 29 NBA teams and for 
example we include the results for the Atlantic Division and give these results in 
Table 3. 

We see in this example that the simulated winning percentages are in reasonable 
agreement with the actual results except in the case of the Nets, the Knicks and 
the Celtics. Given the severity of the challenge of predicting the outcomes of all 
games over an entire year, we find our results encouraging. 

The WL method can be applied in other sports such as baseball, hockey (see 
Hu and Zidek 2002), soccer. In this paper, we chose the same weight for all teams. 
This seems unreasonable in some cases and there we may be able to u.se the rank 
of the teams to get better weights. This is another topic for the future. 

Finally, we would note the abundance of alternative approaches, Bayesian 
(Berger, 1985) and non- Bayesian that could be used in this context. Some specific 
methods were di.scribed in the Introduction, We intend to compare our approach 
with .some of these in future work. Here, we restrictexl our comparisons to an ex- 
tension of one of the non - Bayesian approaches based on that of Bradley and 
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Table 3: The percentage of wins in the actual and two simulated 1997/1998 sea- 
son for the NBA’s Atlantic Division baserl on the WL win probability estimators 
obtained at the end of the previous season. 


Team 

Win % : 
Actual 

Win % 
Simulation 1 

Win % : 
Simulation 2 

Heat 

67 

66 

59 

Nets 

52 

35 

38 

Knicks 

52 

65 

66 

Wizards 

51 

52 

54 

Magic 

50 

54 

48 

Celtics 

44 

20 

26 

Sixers 

38 

27 

35 


Terry (1952) to estimate the probabilities of a Bulls’ win for both home and away 
games against the Jazz. (We found the corresponding probabilities for the remain- 
ing teams as well but do not report them here.) With these ])robabilities we could 
then compute the termination probabilities analogous to tho.se in Table 2. 

To be more precise, we fitted a logistic model using the software R with the 
response variable being 1 or 0 according as the outcome of any gatne during the 
season was a visitor or home victory. We used dummy variables to represent visitor 
and home teams in each game throughout the season. Thus for example, Bulls = 1 
and Supersonics - 1, all other dummies being 0, would mean tho.se two teanus 
were playing for that particular game, the visitors being the Bulls. For each of 
the factors, “visitor” and “home” we represented by the dummies in this way, we 
arbitrarily chose the 76ers’ as the baseline team. Thus, in effect, the fitted intercept, 
suitably transformed, provides an estimate of the likelihood of a “1” in the purely 
hypothetical situation where the 76ers’ played thein-selves at home as the visitors. 
The coefficients for the remaining dummies represent the deviations from the 76ers’ 
performance for each of the other teams depending on whether they were playing 
at home or away. 

The results differed .somewhat from tho.se obtained by the MWLE2 WL method. 
To be specific we found the probability of a Bulls’ win at home to be 0.76 as 
compared with the 0.75 seen in Table 1 while the corresponding probabilities for 
the Jazz were 0.71 and 0.66 respectively. These differences bee-ame more pronounced 
when we computed the probabilit ies corresponding to Table 2. We see a comparison 
of the results in Table 4. 

In Table 4, we see that the Bradley-Terry extension points to a Bulls' victory 
on Game 7 while the MWLE2 is ambivalent between games 6 and 7. Obviously 
a more extensive comparison would be needed to assess the relative performance 
of the methods. But considering the large number of parameters needed by the 
logistic model, these very preliminary results make the weighted likelihood model 
more desirable for forecasting the outcomes of NBA playoff games. 

However, we would not exi>ect our method to do as well as it did above, when 
competing in particular contexts with purpose built methods. Instead, we see its 
value deriving from its relative ease of use and its broad domain of airplicability, 
features it shares with the cla-ssical likelihood itself That is, we see it as a valuable 
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Tabic 4: The predictive probabilities of a Bulls’ win against the Jazz for both the 
MWLE2 and Bradley-Terry (logistic) based methods in the 1996 1997 Final. 



Game # 

Game 4 

Game 5 

Game 6 

Game 7 

MWLE2 

Bulls’ Win 

0.07 

0.10 

0.26 

0.26 


Jazz Win 

0.02 

0.12 

0.09 

0.08 

Bradley-Terry 

Bulls’ Win 

0.05 

0.08 

0.25 

0.28 


Jazz Win 

0.03 

0.14 

0.09 

0.09 


tool in the statistical toolbox. In this paper, we have tried to demonstrate its value 
from that perspective. 

In particular, although in this manuscript we have u.sed only binary outcome 
information about team wins or losses, the theory can be extended to incorporate 
more complex outcome information such as the scores, for example. In that case, 
we could have defined the V^b(/i) to be the score of team A again.st team B when 
team A is at home and .so on. 

The refenx; pointed to another direction for future work when he or she noticed 
that “the weighted likelihood method to estimate the probability of A beating B 
(at home) uses information about B at A, C at A, and B at C. It seems logical to 
use information concerning A at B.” We agree. However, we have not been able to 
do that yet since we do not know how to relate PAsifi) and PAB{r) through the 
WLE. 
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with non-homogeneous compound Poisson 
damage processes 
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Abstract: Failure time distributions are derived for non-homogeneous com- 
pound Poisson cumulative damage processes. We focus attention on Weibull 
type processes with exponential damage size. The hazard functions are il- 
lustrated and their asymptotic behavior investigated. Moment equations and 
maximum likelihood estimates are studied for the homogeneous case. 

1. Introduction 

Bogdanoff and Kozin, in their book (1985) define cumulative damage (CD) as the 
“irreversible accumulation of damage throughout life, that ultimately leads to fail- 
ure”. Such damage can be manifested by corrosion, cracks, physical wear in bearing, 
piston rings, locks, etc. We focus attention on damage processes that occur at ran- 
dom times, according to some non-homogeneous Poisson process. The amount of 
damage t hat accumulates follows a specified distribution. Thus, the amount of dam- 
age at time t, is a realization of a random process {V'(t), t > ()}, where V(t) >0 is 
a non-decreasing process with K(t) — > oo a.s. as ( — * oo. 

A system subjected to stich a damage process fails at the first instant at wdiich 
V(t) > /J. where 0 < /3 < oo is a threshold specific to the system. Thits, the dis- 
tribution of the failure times is a stopping time distribution. We present in the 
present (taptT the methodology of deriving these distributioiLs. We are interested 
in particular in a family of non-homogeneous Poisson processes having an inteiLsity 
function of the Weibull type, namely A(t) = (At)", 0 < A, i/ < oo. In Section 2 we 
specify com|x>und non-homogeneous Poi.s.son damage processes, and the distribu- 
tioti of the cumulative damage T(f), at time t. In Section 3 we derive the density 
and the reliability function of failure times driven by such processes. In particular 
we focus attentioti on cumulative VV'eibull processes with exponentially distributed 
damage amount in each occurrence. We investigate and illustrate the behavior of 
the distribution of failure times and the hazard function. In Section 4 we develop 
estimators of the parameters of the failure distribution in the homogeneous case 
(i/= 1). 

An extensive list of publicatioas on damage proces-ses is given in Bogdanoff 
and Kozin (1985). They provide empirical examples, and mention (p. 28) the non- 
homogeneous Poisson process with a Weibull intensity function. The theory for 
a discrete Markov chain model, having b states of damage is developed in this 
lxK)k. A recent paper on the subject is that of W. Kahle and H. VV'endt (20tX)). 
They have modchxl damage by a marked point process, and focus attention on 
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Ket/words and phrases: cumulative damage processes, non-homogeneous compound Poisson 
proces.ses. distributions of stopping times, reliability functions, hazard functions, moment equation 
estimators, maximum likelihood estimators. 
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doubly stochastic compound Poisson processes. Their formulation is cUxse to ours, 
but they do not provide an explicit formula for the distribution of failure times. 
Other related papers are those concerned with shock models, like Esary, Marshall 
and Proshan (1973), Feng, Adachi and Kowada (1994), Shakwl (1983), Soczyk 
(1987). 


2. Compound cumulative damage processes 


We consider cumulative damage processes (CDP) modeled by non-homogeneous 
compound Poisson processes. In this model, the system is subjected to shocks at 
random times, 0 < ti < T 2 < • • •, following a non-homogeneous Poisson process, 
with an intensity function \{t) (see Kao, 1997, pp. 56). The amount of damage to 
the system at the n-th shock is a random variable X„, n > 1. We assume that 
Xo = 0, Xi,X 2 ,... are i.i.d., and that the sequence {X„,n > 1} is independent of 
{r„,n > 1}. 

Let {N{t),t > 0} be a non-homogeneous PoLsson counting process, with A^(0) = 
0 where 

N{t) = max{n : r„ < t}. (1) 


{N{t), < > 0} is a process of independent increments such that, for any 0 < s < t < 

00 , 


P{N{t) - N(s) = n} = ’ 


( 2 ) 


n = 0, 1,. . ., where rn(<) = J X{s)ds, 0 < t < oo. The compound damage process 
(CDP) {Y{t),t > 0} is defined as 


N(t) 

Y(l) = ^ X„, (3) 

n=0 

It is a compound non-homogeneous Pois.son process. The compound Poisson Process 
(CPP) is the special case of a constant intensity function, \{t) = A, for all 0 < 
t < oo, 0 < A < oo. We restrict attention in the present paper to the family of 
compound Weibull processes (CWP), in which A(t) = Ai/(Af)"~*, 0 < i < oo for 
0 < A, 1 / < oo. Furthermore, we assume that X„, n > 1, are absolutely continuous 
random variables, having a common distribution function, F, and density /. 

The cdf of Y{t), at t > 0, has a discontinuity at y = 0, and is absolutely 
continuous on 0 < y < oo. It is given by 

0 = E £-”•<'> (4) 

' n! 

n=0 

with D{0\t) = exp(-m(t)), and F^"^ is the n-fold convolution of F, i.e., 

r F(y), if n = 1 

"" I if n > 2. 

The defective density of Y (t) on (0, oo) is 

= /<">(!,). (G) 

f n! 
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where /*"* is the n-fold convolution of the density /. We will luse the notation 
p(n; ft) and P(n; ft) for the probability function and cdf, respectively, of the Poisson 
distribution with mean ft. Accordingly, the density of the CWP, at 0 < j/ < cx3 and 
0 < < < oc is 

OO 

d(v;«,A,i/) = ^p(n;(Atn/<"'(y), (7) 

n=l 

and its cdf is 

OO 

D(y, t, x,it) = Y, p(«; (8) 

n=0 


We consider a special case of these functions, when the amount of damage X„ is 
exponentially distributed, with parameter ft, i.e., E{X„] = A, In this special case 
= pp(n - l;py) and F*"*(y) = 1 — P{n — l;py). The results of this paper 
can be generalized to damage processes driven by compound renewal proces.ses with 
any distribution F. 


3. Cumulative damage failure distributions 

A cumulative damage failure time is the stopping time 

T(0) = inf{t > 0 : Y(t) > /?}, (9) 

where 0 < 0 < oo. Since Y{t) is non-decreasing a.s., we immediately obtain that, 
in the continuous case. 


P{T{0) >(} = D{0-, t), 0 < f < OO. (10) 

This is the reliability (survival) function of the system. Thus, for the CWP, with 
general damage distribution. 


P{T{0) >t} = ^p(n;{Af)‘')F(")(/3). (11) 

n=0 

In the special case of ex{X)nential damage distribution, 

OO 

P{T(0) ><} = 1 - ^p(n:(At)*')F(n - l;pd). (12) 

nsl 

We see in (3.4) that, in the exponential case, the distribution of T(0) depends on 
ft and 0 only through ^ = ft0 = 0/E{X\}. Accordingly, let denote the 

reliability function of a system under CWP with exponential datnage distribution 
(CWP/E). 

Thcxjrem 1. Under CWP/E the reliability function is 

OO 

R(UX.o,C) = Y.P^’0PirAW)- (13) 

j=0 

Proof. According to (3.4), 

OO n— 1 

«(f;A,i/,0 = 1 - 5^p(rr;(A<)‘')5]]pO:0 

n=I j=0 
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oo oo 

= ^-^p(j<0 p(n;(M)‘') 

j=0 n=j+l 

= i-f]p(j;0(i--Py;(A<n)- 

J=0 


This implies (3.5). □ 

It is obvious from (3.1) that P{T{(3) < oo} = 1 for any 0 < 0 < oo. This follows 
aLso from the following theorem. 

Theorem 2. Under CWP/E, 7?(0; = 1, R{t; A, u, ^) is strictly decreasing in 

t, for (A, u, Q fixed, and lim R{t] A, u, Q = 0, for any (A, u, ^) in 

t— •oc 

Proof. According to (3.5), since limP(j; (Ai)*') = 1 for all j = 0,1,... and any 
0 < A, t/ < 00 , the bounded convergence theorem implies that 

oc 

lini P(<; A, I/, C) = ^p(j;C) Jim P(j; (Ai)*') = 1. 

^ 3=0 

Furthermore, the PoLsson family is an MLR family and P(j; (Ai)*') ], t. Hence, 
R{t;X,i/,Q I t, i.e., ^R{t,X,u,Q < 0, for any fixed {X,u,Q, 0 < A, i/, C < oo. 
Finally, since lim P(j; (Ai)") = 0 for any fixed j > 0, 0 < A, < oo, the dominated 

t— »oo 

convergence theorem implies that lim R(t;X,u,Q = 0, for any 0 < A, < oo. 

t—*oo 

□ 


Theorem 3. Under CWP/E, the density ofT{Q, 0 < C < oo, is 

OO 

f{t]X,u,0 = Xu{Xt)‘'~^'^p{j\Op{j;{XtY), 

j=o 


and its m-th moment, m > I, is 


£{(r(0)™) = j^Ep(i;0 

3=0 


r(i + i + f) 
r(i + i) 


(14) 


(15) 


Proof. It is easy to verify that 

— P{j;u;) = -p{j;uj), 0 < cj < oo. 

Moreover, 

= -|p{rOT><} 

fi °° 
j=0 

This implies (3.6), since R{t,X,u,Q is an analytic function of t, or by bounded 
convergence. To prove (3.7) we write 

£{(r( 0 n = r 

Jo 
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= t/A 


\uj fC 
j =0 •'0 




U ■'0 

= J_^py.n£(i±i±ll 

ry + i) 


□ 


Corollary. In the homogeneous case (i/ = 1) with exponential damage, the expected 
value, variance and coefficient of skewness ofT{Q are, correspondingly. 


E{T{P)\X,iy = l,Q== 
1/{T(C)| A,j.= l,C} = 


1+C 

A ’ 

1 +2C 
A2 


and 


7.(r(0) = 


2(1+30 

(1 + 20’^^' 


(16) 

(17) 

(18) 


Notice also that equation (3.7) shows that moments of T{Q of all orders exist, 
since moments of all orders of the Poisson distribution exist. In Figure 1 we present 
several densities of T(^), for A = 1, C = 5 and u = 1.1, 1, .9. According to eq. (3.6), 


00 , 


if 1/ < 1 


lim/(<; A,t/, C) = < Ac if i/ = 1 
' 0, ifi/>l. 


(19) 


Indeed, limj_op(j; (At)*') = I{j = 0}, i.e., 1 if j = 0 and 0 otherwise. Thus, 
\hr\t^oJ2jLoPU'^OpU'^ W‘') = P(0;C) = The densities /(t;A,i/,C) are uni- 
modal whenever u > 1, and bi-modal when u < \. Figure 1 does not show the 
behavior of these densities in the interval (0, 1). We see that the density becomes 
more symmetric as C, grows. Indeed, ■§^l\{T{Cji) = < 0 for all 0 < C < oo. 

From e<j. (3.5) we obtain immediately that the reliability ftmction R{t\ A, u, Q, 
is a strictly increasing function of (^, for each fixed (t,X,i/). This result is obvious 
from (3.1) if p = 1. Generally, for fixed t,X,u P(J\{XfY) is an increasing function 
of j. Hence, since the Poisson family {p(-;C)»0 < C < oo} is a monotone likelihood 
ratio family (MLR), E(^{P{J]{Xt)‘')} is an increasing function of C- 

The hazard function under CW’P/E damage processes is 


/?.(<; A, t/,C) 


Xuixty ‘E*oP0'?C)p(j;(A0‘') 

T,T=oP(M)PU-,m‘') 


We obtain from (3.11) since lim/_o P(j’; (At)*') = 1 for all j > 0, that. 


( 20 ) 


lim h(t;X,iy,C\) 
1—0 


oo, if 0 < z/ < 1 
Ae"**, if 1 / = 1 
0, if z/ > 1. 


( 21 ) 


In Figure 2 we illustrate the hazard function (3.12) for A = 1, C = 5 and u = 
.53, .55, .57. 
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Figure 1: Densities of T(^), A = 1, ^ = 5, i/ = 1.1 — , i/ = 1.0 •••,// = 0.9 


Similar types of hazard functions were discussed by Aalen and Gjesing (2003). 
We examine now the asymptotic behavior of the hazard function (3.12), as 
t —* oo. Make first the transformation u = (Af)". In terms of u, the hazard function 
is 


/i*(u; A, iv, C) = • 


E^{p{J\u)} 

E^{P{J;u)y 


( 22 ) 


where J Pois(C). 

Theorem 4. For a fixed A, u, Q, the asymptotic behavior of the hazard function is 


lim h*{u; A, u, ^) = 

u— »oc 


oo, if ly > I 

A. if u = \ 

0, if u < \. 


(23) 


Proof. Since p{j; u) < P{j\ u) for j = 0, 1, . . . and each u, 0 < u < oo, 


We now prove that 


mir 

u — 00 E(^{P{J;u)} 


< 1. 


«-oo E({P(J-,u)] 


= 1. 


(24) 


(25) 


Copyrighted material 


402 


5 . Zacks 



t 

Figure 2. Hazard Functions, A = 1, ^ = 5, i/ = .57 — , ly = .55 ■ u = .53 


First, by dominated convergence, lim EAp{J\ «)} = Er{ lim p( J; u)} =0. Sim- 

U— *00 U— -*CX5 

ilarly, lim Er{P{J;u)} = 0. By L’Hospital rule, 

u— *oc 


Et:{p{J;u)} 

u-^oo E^{P{J-,u)} 


lim 

tl— *oo 


lim 

11— *oo 


££<{P(J;u)) 
Ec{p{J;u) - p{J - l;u)} 
Ec{p(J;u)} 


1 - lim 


u— *oo 


^p(n-l- l;C)p(n; u) 



OO 

Op(w; w) 

n=0 


Furthermore, 


OO 

'^Pin + l;C)p(n;u) 

n=0 

OO 

^pin;0p{n;u) 

n=0 


1 

y'— tp(w;0p(w;«) 
+ 1 

n=0 

OO 

y^p(n;C)p(n;w) 

n=0 
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Fix a positive integer K (arbitrary). Then, 

OO j 


R{C,u) = 


fi=0 


n=0 

"1 1 
y" —-rp{n] 0p(«; w) + TT-^ Y" p{n-, C)p(n; u) 
^-^11 +1 A + Z 


n=0 


n=/C + l 


») + X] p(«;C)p(^j;w) 


n=0 n=K'+l 

Finally, since p(n; u) — ♦ 0 as u — ► oo for each n = 0, 1, . . ., 

K j A 

y^pO;C)p(j;«) = 0- 

li— *oo ' 7+1 H— *oo ' 

j=0 j=0 

Thus, 


OC 


lim R{^\u) < 


1 


lim 


51 pO;C)p(j;w) 

i=A'+i 

/C + 2 ti 

2^ p(j;C)7Hj;«) 

J=A- + 1 

— - — , for all fixed 

A + 2 


(26) 


□ 


In Figure 3 we illustrate a Imzard function for A = 1, ^ = 5, /^ = 0.5. 


4. Estimation of parameters 

Let Ti, 72, ...,T„ be i.i.d. random failure timre following CWP/E. The likelihood 
function of the parameters (A, i/, <^) is 

L(A,r/,<;T„...,r„) = (A)"V (flrr' ) ' ]lf;p(j;C)p(j;(Ari)‘'). (27) 

\<=1 / j=lj=0 

Accordingly, the minimal sufficient statistic is the trivial one . . . ,T(„)), where 
0 < T(i) < T(2) < • • < T(„). 


4‘1. Moment equations estimators of X, in the homogeneous case, 

1/ = 1. 

Ti n 

Let A/| ^ E Ti and 1 M 2 = be the first two sample moments. The moment 

t=l i=l 

equations estimators (MEE) of A and C Jwe obtained by solving the equations. 


1+C 

A 


= My 


(28) 
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2 + 4C + 


Or, equivalently, 


and C is the positive root of the quadratic equation 

A real root exists provided M 2 < 2A/?. Since 2A/? — A /2 (4)^ > 0, an MEE 

n-»oo ^ 

exists for n sufficiently large. It is given by 

; (2A/? - A/2)'/*(A/, + (2A/? - A/2)‘/2) 



Both A and C are strongly consistent estimators of A and respectively. The mean 
squared errors of tliese estimators can be approximated by the delta method. We 
obtain 


MSE{A} = — 

n 


A^ 1 + 12< + 58C^ + 144C® + 192C‘* + 128C'* + 32C** „ / 1 


< 2 ( 1 + 20 " 




MSE{<} = ^(2(1 + 0" - (1 + 0" - C") + O j . 
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In the following table we comi>are the values of the MSE, as approximated bj- eq.’s 
(4.7) and (4.8), to those obtained by simulations. When i/ = 1 the distribution of T 
is like that of X^(2| C]/(2A), where x^[2;C] ® non-central chi-square with 2 degrees 

of freedom, and parameter of non-centrality Thus 

T ~ (Nf{y/c,l) + AfKTc, 1))/(2A), 

where iVj(v/C, 1) (i = 1,2) are i.i.d. normal random variables with mean >/C and 
variance 1. 10,000 simulation runs yield the following results 

Table 1. MSE Values of the. MEE By Delta Method and By Simulations 


D 

B 

■a 

1 Delta Method 

1 Sinmlation | 




A 

i c 

i A 1 


1 

5 

50 

100 

0.0568 

0.0284 

m 

ImxBl 

Wfral 

2 

■ 

50 

100 

m 

IHI 


IweHI 


We notice that the delta method for samples of size 50 or IIX) is not sufficiently 
accurate. It yields values which are significantly smaller than those of the simulation. 
Also, since the MEE A and C are continuously differentiable functions of the sample 
moments M\ and A/a, the asymptotic distributions of A and C are normal, with 
means A and ^ and variances given by (4.7) and (4.8). 


4.2. Maximum likelihood estimators, 1 / = 1 
The log-likelihood function of (A, C), given T‘"* is 


n 

/(A, C; T<")) = n log A 5]; log E^{p(J-, AT^)}, (35) 

i=l 

w'here J ~ Pois(^). Accordingly, the score functions are 


dX 


1(A. C; T*")) = + 


(36) 


and 


where 


^f(A,C,T<")) = -n + Af;r,H'(A,C,r*), 


W(X,C,T) 


Ei{j^p(J-,XT)} 

E^{p{J;Xr)} 


(37) 


(38) 


Let A and ^ be the maximum UkeUhood estimators (MLE) of A and (,", respec- 
tively. 

From (4.10) and (4.11) we obtain that, as in (4.4), 


A = 


1 + C 

Ml ■ 


(39) 
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Substituting A in (4.11) we obtain the function 


1(C) = ( 1 + 0 E l/iiv ( ^ , C. A/, [/*) - n, 

»=sl \ * / 


where C/| = Tj/A/j. More specifically, 


7(o = (i+oE^- 

tsi 


£c{rbp(-/;(i + C)t/t)} 
£<{p(J;(i + C)t^0} 


— U. 


( 40 ) 


(41) 


Notice that /(O) = 0. The MLE of is the positive root of 1{Q = 0. N = 1,000 
simulation runs gave the following estimates of the MSE of A and C, when A = 1, 
^ = 5 and n = 50, nantely: 


MSE(A) = 0.06015 and MSE(C) = 2.13027. 


As exjtected, these estimators of the MSE of A and ^ are smaller than those of the 
MEE estimates, given in Table 1. The asymptotic distribution of the MLE vector 
(A,C) is bivariate normal with mean (A,C) and covariance matrix AV, which is 
the inverse of the Fisher information matrix. The asymptotic variance-covariance 
matrix of the MLE can be estimated by simulation. N = 10, 000 simulation runs 
gave, for the case of A = 1, C = 5 the asymptotic variance-covariance matrix 



n 


2.33917 

13.04000 


13.04000 

83.30706 


Thus, the asymptotic variance of C for n = 50 is AV'(^) - — 1.66614. We 

see that the estimated variance of ( is, as in the case of the MEE, considerably 
larger than its asymptotic variance. The convergence is apparently very slow. 
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Abstract: Herman Rubin was born October 27, 1926 in Chicago, Illinois. He 
obtained his Ph.D. in Mathematics from the University of Chicago in 1948 at 
the age of 21. He has been on the faculty of Stanford University, the Univer- 
sity of Oregon, Michigan State University and Purdue University, where he is 
currently Professor of Statistics and Professor of Mathematics. He is a Fellow 
of the Institute of Mathematical Statistics and of the American Association 
for the Advancement of Science as well as a member of Sigma Xi. 

He is well known for his broad ranging mathematical research interests 
and for fundamental contributions in Bayesian decision theory, in set theory, 
in estimations for simultaneous ecjuations, in probability and in asymptotic 
statistics. 

These conversations took place during the 2003-2004 academic year at 
Purdue University. 


Herman, it is great that the IMS is bringing out this Festschrift for you. I 
am delighted to be able to prepare this interview with you. I guess we al- 
ways want to know about childhood. So Herman, where did you grow up? 

I wa.s born in Chicago, Illinois, and grew up there, the oldest of three children. 
Both of niy parents were immigrants, my father from Russia and my mother from 
Riussian-occupied Poland. My mother’s sister was also an immigrant and she taught 
me to read at the age of three. 

What was your educational background? Did you receive special training 
in mathematics? 

I went to the Chicago public schools for grammar school and was a voracious 
reader in the public library. But the material was organized by grade level and I 
did not find much on mathematics beyond arithmetic. But the summer before I 
went to high school I discovered algebra when I came upon a book about it while 
visiting New York City. After reading the book, I tested out of algebra in the first 
month of the first year of high school. In high school I found many more advanced 
books about mathematics in the public library; I taught myself material through 
calculus while taking plane geometry in high school. After two years at the public 
high schools, I was given a .scholarship to a combined high school/college program 
at the University of Chicago. I could have graduated high school after a total of 
three years but delayed the official high school graduation by one year because I 
could take more college courses and not pay the college tuition. I received the high 
school diploma in June of 1943, the bachelor degree SB (Mathematics major with 
Physics minor) in December of 1944 and the master degree SM (Mathematics) in 
.March of 1945. At the University of Chicago, almost all of my courses beyond the 
bachelor’s were in abstract mathematics but my Ph.D. dissertation was in statistics. 

How did you get interested in the field of statistics if most of your courses 
were in abstract mathematics? 

My interest developed during my stint at the Cowles Commission for Research 
in Economics (CCRE) which was housed at the University of Chicago. In 1944 
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CCRE needed a mathematics research assistant because their current assistant 
was being drafted into the U.S. military. At the time I was a student in the un- 
dergraduate/graduate program at the University of Chicago and, aside from my 
mathematical abilities, one of my qualificatiorxs was that I was too young to be 
drafted. So in July of 1944 at the age of seventeen I became a research assistant for 
CCRE. 

I became interested in statistics becau.se the leader of CCRE, Jacob Marschak, 
who took over in 1943, had decided to concentrate the work of the group on the 
problems of stochastic simultaneous equations found in economics. 

Who worked with you when you joined CCRE? 

My initial work was with Tjalling Koopmans who had joined CCRE at the same 
time as I. He was brought in to concentrate on the mathematical aspects. My first 
paper was a solution to a problem of Koopmans for the approximate distribution of 
the circular serial correlation coefficients under the null hypothesis and it appeared 
in the Annals of Mathematical Statistics in 1945. The main problem I worked on 
with Koopmans was to estimate the parameters of a system of stochastic equations 
including lags and to derive their properties. (Individual equations might have more 
than one dependent variable and least .squares was already known to be inconsistent 
when applied individually to each equation.) I developed some Maximum Likelihood 
techniques and their properties for the time series lags to attack the problem. 

I understand that the work at CCRE was interrupted. 

The work was interrupted because I was drafted into the U..S. Army in March, 

1945, at the age of 18. The bulk of the work I mentioned was published as a Joint 
paper by myself with Koopmans and Roy Leipnik. (Roy was a research assistant in 
CCRE from February, 1945, to July, 1946, and took over the work with Koopmans 
after I was drafted.) 

I was discharged from the Army in December, 1945, and returned to the Univer- 
sity of Chicago as a graduate student and CCRE as a research assistant in January, 

1946. (CCRE promoted me to research associate in November, 1946.) 

Who worked with you on your return to CCRE? 

I began to work with Theodore W. Anderson who had joined the CCRE as 
a research associate in November, 1945, in my absence. One .source of inspiration 
for our work was a talk I heard after my return given by the biologi.st Sewall 
Wright. (He had given a general formulation for the problem of solving simultaneous 
stochastic equations in 1919.) I realized that factor analysis was another example 
of simultaneous stochastic wpiations and this led to a paper on it with Anderson. 

Anderson and I collaborated on three papers. The first paper developed the 
maximum likelihood estimator of the coefficients of a single equation in a system 
of stochastic equations; the estimator is now known as the Limited Information 
Maximum Likelihood (LIML) estimator. The second paper developed the large- 
sample distribution theory. The LIML estimator had been developed in Anderson’s 
1945 dissertation. Our third joint paper developed maximum likelihood methods 
for factor analysis models with different identification conditions. It was a pretty 
innovative paper at the time. 

Another source of interesting questions was Meyer A. Girshick. Early in 1946 
Koopmans gave me a letter from Girshick about the problem of estimating a sin- 
gle equation (with more than one dependent variable) without estimating the en- 
tire complete system of equations. (A system of equations is complete if there are 
enough equations of the right sort so that all the coefficients could be consistently 
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estimated, essentially a multivariate regression problem.) I developed it somewhat 
and then collaborated on further aspects with T. W. Anderson. This work (with 
credit to Girshick) appeared finally in 1949 and 1950 in the Annals of Mathematical 
Statistics. The publication was somewhat delayed because in those days it was a 
major job (without the benefit of email) to communicate with the referees and my 
coauthor Anderson who was in Sweden during the 1947-1948 academic year. (He 
left CORE in September, 1946, to go to Columbia.) 

What about the Ph.D. degree? 

I received the Ph.D. degree from the University of Chicago in March, 1948, at the 
age of 21. My official advisor was Paul Halmos in the Department of Mathematics. 
The dissertation topic grew out of my work at CCRE. It involved extending the 
original problem of Girshick of estimating a single equation to that of estimating a 
subsystem of equations without estimating the entire complete system of equations. 
The dissertation was typed up while I was on leave from CCRE as a post-doc at the 
Institute for Advanced Study in Princeton during the academic year of 1947-1948. 

You have made major contributions to the field of asymptotics. Why do 
you feel asymptotics are important? 

The need for asymptotics at CCRE inspired me and this culminated in my first 
major insights in 1949. Some of my contributions to the problem were the asymp- 
totic theorems on limiting distributions which were never published. I introduced 
the idea of a random function into the generalization of the Slutsky Theorems, 
.lames Hannan and Vaclav Fabian gave the proofs in their book crediting me. For 
inspiration I used general topology (although metric topology is adequate). For me, 
the more I generalize the problem to an area of abstract mathematics the easier it 
is for me to understand it since I can get rid of the part which doesn’t add to the 
meaning of the problem. Even when I computed something, if I could generalize it, 
then it led to insight. 1 know that is not how most people like to do mathematics. 

You have had an abiding interest in computing. What was it like then? 

At CCRE, computations were done with electromechanical desk calculators and 
a staff of three operated the calculators. Computations BC (Before Computers) 
were much slower. I was in charge of computing at CCRE until Herman Chernoff 
took over when I left for Princeton in August of 1947; we had some pretty funny 
experiences making the equipment work. (He had come there as a research assistant 
to CCRE in .July, 1947.) 

You are well known for your interests in statistical decision theory. Was 
it influenced by the CCRE experience? 

The CCRE emphasis on economics was a factor. The idea of a utility scale for 
actions assuming that the state of nature is fully known, which goes back much 
farther, was important in quantitative economics for a long time. However, no es- 
sential progress had been made in getting a clear scale until the von Neumann - 
Morgenstern axioms for cardinal utility appeared in their book “Theory of Games 
and Economic Behavior” in 1944. One of their key contributions was the use of ran- 
domization. Researchers at the CCRE in 1947 were considering extending the ideas 
to unknown states of nature while I was there. I observed that adding one simple 
axiom made the utility for unknown states of nature a positive linear functional of 
the utility functions indexed by the given states of nature. (This is es.sentially the 
prior Bayes approach.) 

In the early years of decision theory, the main progress was made in proving 
theorems and refining the concepts, and I had my share in this. Stanford was a 


Copyrighted material 


Conversations with Herman Rubin 


411 


center of activity in this and I went there after leaving CORE. I had various de- 
grees of collaboration with Blackwell, Girshick, Karlin, and Chernoff, and numeroiLs 
discussions with Stein. Four dissertations on decision theory were written under me. 

You have worked on a variety of problems in probability, particularly sto- 
chastic integration, characterizations and infinite divisibility. You have 
collaborated with numerous people on these. How was that experi- 
ence? 

Yes, I have collaborated with C. R. Rao, Burgess Davis, Tom Sellko, Anirban 
DasGupta, Steve Samuels, Prem Puri, Rick Vitale and many others on questioas in 
probability. The results with C. R. Rao got to bo known as Riio-Rubin theorems; 
we were both visiting Stanford that year. Burgess Davis and Jeesen Chen asked an 
interesting question about uniform empirical processes. Tom Sellke and I worked 
on several Choquet type dec:omposition problems in the eighties. 1 have always 
enjoyed using characteristic funct ions as a tool, as those works did. I am glad my 
book length review with Amp Bose and Anirban DasGupta on infinite divisibility 
got published a couple of years ago; we worked many years on that one. With Prem 
Puri and Steve Samuels, the works were more in applied probability, but they were 
good problems. And, you mention stocluistic integration. Yes, I too had thought of 
the Stratonovich integral. I gave a talk introducing the idea behind the Stratonovich 
integral at the IMS meeting in Seattle in 195(>. My Ph.D. student Don Fisk later 
wrote a thesis on it in 1961. I myself did not write it up or pursue it formally. 
Probability (juestions are always interesting. 

In the fifties, you collaborated with Karlin on introducing monotone 
likelihood ratio. This has had a very major impact. How did that idea 
originate? 

Steve .Allen, a Ph.D. student under Girshick, had come up with a proof that 
in the exponential family, monotone procedures are essentially complete for ont?- 
sided testing problems. I first, wrote a technical report. Karlin and I realizwl it 
works for monotone likelihood ratio. Wt; generalized that result of Allen and gave 
applications. Yes, it later led to concepts such as total positivity. Karlin has written 
much about it. 

What did you do with Chernoff? 

That was the beginning of my interest in the discontinuous density problems; 
we had a paper together in the third Berkeley symposium. But the relationship 
extended beyond professional collaboration. 

You have a number of publications in set theory. How did this interest 
arise? 

I was always interested in set theory and while in graduate school at Chicago I 
took a course from the topologist .John L. Kelley which piqiied it even more. There 
is a version of set theory that he showed me (called the Monse-Kelley set theory) 
which is stronger than the usual set theory because you can prove the consistency 
of the usual .set theories (such as the Zermelo-Frankel or the von Neuman-Bernays- 
Godel) in the Morse-Kelley system. 

lYom CCRE I went to Stanford’s Department of Statistics in 1949 as an Assis- 
tant Profe.ssor and eventually met .Jean Hirsh when she arrived later as a mathe- 
matics Ph.D. graduate student in logic there. We married in 1952. Her interests in 
logic and mine in set theory eventually led to a professional collaboration. 

Later Pat Suppcs was teaching a class on set theory for which I gave some 
lectures on the axiom of choice. Professor Suppes who knew both of us suggested 
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that Jean and I write a book on the various equivalents of the axiom of choice. 
(Jean received the Ph.D. in mathematics for her work in logic in 1955 and Suppes 
was her atlvLsor.) After at least eight years, two moves and two children, we finally 
finished the book. 

With two parents with Ph.D.’s in mathematics, were the children also 
interested in mathematics? 

Art hur was the oldest (born in 1956) and went on to get a Ph.D. in mathematics 
from California Institute of Technology (at the age of 22) after being a Putnam 
Fellow four times. Arthur and Paul Erdos wrote a jiaper together. Leonore who was 
born in 1958 received a bachelor’s degree with honors jointly in mathematics and 
chemLstry from Michigan State University and went on to get a Ph.D. in chemistry 
from Carnegie Mellon. 

You mentioned several moves. Where did you go? 

After 1 left Stanford in 1955, I went to the Department of Mathematics at 
the University of Oregon for four years. (Becairse of nepotism rules, Jean was not 
allowed to have a regular position at the same university or even paid by the State 
of Oregon.) I had some collaborations with Howard Thcker and A.T. Bharucha- 
Reid at Oregon. From Oregon I went to Michigan State University’s Department of 
Statistics in 1959. Again, Jean could not be hired because of nepotism rules. The 
set theory Imok on the axiom of choice by Jean and me was published while we 
were at Michigan State. 

Most of my collaborations at Michigan State were with Martin Fox in decision 
theory, game theory and functions of Markov states. It was also the start of a col- 
laboration with J. Sethuraman. We did some work on what is now calle<l moderate 
deviations. We also later collaborated on Bayes risk efficiency. 

Then in 1967 we both came to Purdue where Jean received an offer from the 
Department of Mathematics that included tenure. I Joined the Department of Sta- 
tistics and the Department of Mathematics as a full Professor and .Jean joined 
Math as an assistant profes.sor. I have been here ever since and my wife Jean was 
a full Profes.sor of Mathematics here until her death in 2002. She is honored by an 
annual seminar and remembered for her support for women faculty in academia. 
She started a scholarship fund for mathematics .students in her will. 

One of your strong ongoing interests is in prior Bayesian robustness. 
How do you describe it? 

One of the difficulties of Bayesian analysis is coming up with a good prior and 
loss function. (I have been saying for years that the prior and the loss cannot be 
separate<l. The Carnegie Mellon school is doing some work on that now.) When 
I talk about prior Bayesian robustness I assume that one does not yet see the 
random obsi!rvation X whose distribution depends on the miknown state of nature. 
One considers the choice of different priors for which one averages over the possible 
states of nature and over the possible random observations. This is different from 
posterior Bayesian robustness in which one considers the choice of different priors 
given the random observation X whose distribution depends on the unknown .state 
of nature. If you can get posterior Bayesian robustness, then you automatically get 
prior Bayesian robustness but seldom are we so lucky as to find posterior Bayesian 
robjistness. It is actually the axioms of utility that decree we should worry about 
prior Bayesian robustness. When I am faced with a choice among priors, all of which 
.seem about the same to me, then I am very concerned about the p>ossible alternative 
consequences of applying either one if it is drastically wrong. For instance, suppose I 


Copyrighted material 


Conversations with Herman Rubin 


413 


am using squared error loss to estimate the mean of a normal random variable with 
variance one. The first prior for the unknown mean might be normal with mean 
zero and standard deviation 10 while a second prior for the unknown mean might 
be normal with mean zero and standard deviation 1000. Now using the first prior 
could be disastrous (in terms of a loss that is averaged over the values of the state 
of nature as well as the possible mean values) when the .second prior is appropriate. 
Yet using the second prior would not be so bad if the first prior were appropriate. 
In contrast for the posterior Bayesian robustness approach, if the observation X is 
large, then the posterior loss is bad in either case if the wrong prior is used. 

You have had a long term interest in random number generation. What 
inspired you? 

I heard that a professor at Columbia had announced to his class that he would 
give a midterm in each of five three-week periods, the particular week to be chosen 
at random by tossing a coin. Finding an efficient way to do this was an interesting 
problem to me. I observed that generating all five results at once was far more 
efficient that generating the results one at a time from a discrete distribution. This 
eventually led to less trivial questions and was the start of my interest in efficient 
methods for generating random numbers. 

The two main problems I find interesting are the following: how to get lots of 
random numbers which are independent and uniform; and how to turn them into 
independent random numbers from some other distributions. 

In the case of the first main problem, when generating independent uniform 
random variables, most people u.se p.seudo random numbers. It is almost impossible 
to prove that they have all the desired properties. Of course, they fail the test 
that they come from the pseudo random number generator! For physical random 
numbers, one can question the accuracy of the model for the physical process. (I 
have a technical report about paradoxes cau-sed by the effect of dead time.) My 
personal preference is to u.se a stream of physical random numbers and a stream of 
pseudo random numbers to produce a stream of random numbers whose qualities 
should be at least as good as either of the original two streams. 

In the case of the second main problem, even when you have independent uni- 
form random variables, the problem of using them to generate variables from other 
distributions is sometimes hard. The basic issue of efficiency is not the question of 
the number of bits used but rather the computational cost. . . and this is a complex 
question. I have some technical reports on these issues. 

Computing issues in probability and statistics have been a topic of re- 
search for you, too. What are your comments? 

Computation is an obvious issue in the generation of random numbers. But 
computation of probabilities is also important and it is often best done through 
the use of characteristic functions (i.e. Fourier transforms). I find that reasonably 
efficient computational procedures require complex integration. . . and that recjuires 
a knowledge of analytic functions. 

Another important area is the computation of Meiximum Likelihood estimates 
(MLE). This typically requires more expertise in analysis than is usually expected 
in most statistics graduate programs. For Bayes procedures, integration computing 
problems are commonplace. Many have pointed out the difficulties in posterior 
Bayes computations. Simulation is another area of computing problems. In 1970, I 
was using simulation to compute theoretical expectations for Kolmogorov-Smirnov 
and Kuiper statistics under nonnull hypotheses. The finite sample distribution was 
approximated by a modification of a Brownian Bridge. My first observation was 
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that I could not just simulate at a finite number of points becaiuse the Brownian 
Bridge changed too rapidly. This was handled by simulating the max and the min in 
various intervals independently (even though the max and min are not independent) 
and it worked quite well. The reason it worked well was that the probability that 
the max and min of the whole process are both in the .same interval is extremely 
small. This points out that it is often necessary to do analysis before numerical 
analysis. 

How has the internet affected your work? 

I contribute to newsgroups and give advice to those that ask. I also join extended 
discvLssions on how things should be done. In general, I find out things that I might 
not otherwise know because the internet puts me in contact with lots of people. It 
is much easier to collaborate and I think it is a great advantage for research. You 
see, you can put a hard problem on the web and get help from experts. 

I have not yet directly engaged in internet communication in producing papers, 
but I have been a joint author with local collaborators who have. It would help if 
we could have an easier way to communicate mathematics notation. 

What are you working on these days? 

Hui Xu, a Ph.D. student, is doing .some work with me on density estimation 
when there are discontinuities. I have had a longstanding interest in that. I am 
also doing some work on random number generation with Brad Johnson, another 
Ph.D. student. And I just finished a paper on the Binomial n problem with Anirban 
DasGupta; it is coming out early next year in the Chernoff Festschrift! 

What else do you enjoy? Have you done much traveling? 

Well, I enjoy going to the concerts and the operas, although I did not so much as 
a child. I try to keep track of what is going on in mathematics. Previously, I could 
only do it by picking up the journals at the library. Now you can do .some of it by 
using the net. I think what I enjoy the most is talking to students and people and 
be of any help that I can. You see, I have an open door policy. 

For me, the most rewarding part of traveling is talking to people about interest- 
ing questions. I enjoyed going to the International Congress of Mathematicians at 
Stockholm, the Oberwolfach meeting I went to, and a meeting at Israel. I went to 
the ISI in 1974. Mahalanobis had just passed away, but there were a lot of people 
from everywhere at that meeting. I remember Persi Diaconis being there and Peter 
Bickel and many others. Urbanik was eating the raw jalapenos for his snacks. But 
even the symposium food was too spicy for me. From Calcutta, I went to Delhi. B. 
K. Kale invited me to come to Jaipur. It was an interesting trip. I went by a private 
limousine and returned on a public bus! Anirban wants to take me back there. We 
will see. 

Have your many years of teaching influenced your ideas about statistical 
education? 

Definitely. I believe that it is the unusual person who can go easily from the 
specific to the abstract. I think it is easier to go from the abstract to the specific. 
(Most of my colleagues disagree with me on this.) I have no objection to using 
examples after a concept. But going from special ca.ses to the general .still leaves 
the need for unlearning, which is difficult. 

Because theorems and proofs are an important part of mathematical statistics, 
I believe that students who did not have some kind of course with theorems and 
proofs in high .school, say Euclid-type geometry, flounder when they reach math- 
ematics in college. We miust improve quality of mathematics education in the US. 
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Competition is getting very strong and economic health is directly related to (juality 
of education. 

But even more important than experience with theorems and proofs are courses 
that emphasize concepts. For instance, thinking about integration as a limit of a sum 
is a crucial idea in statistics, especially for expectations. Students have difficulty if 
they learn integration as antidifferentiation, i.e. the “opposite of differentiation,” 
and not as a summing process. I believe that it is possible for students to learn 
concepts directly if properly explained. This does not mean that a student will be 
able to use the concept upon hearing the words. Considerable learning may need 
to occur before the “light bulb” go(‘s on. 

To close the interview, what do you have to say about the future of 
statistics? 

The biggest opportunities lie in the development of decision theoretic approaches 
to the problems of individual users wdiere one considers ALL the consecjuences of 
the proposed solutions. Taking all the comsequences into consideration can produce 
very difficult mathematical problems and provides great opportunities for those 
with mathematical expertise. 

This is in contrast to the emphasis today on the development of general recipes 
that are used for solving problems and that are often used inappropriately. The 
latter two-thirds of the nineteenth century .saw a similar emi)hasis. The turnaround 
came after World War II with people going into statistics from good mathematics 
programs who could attack the challenging mathematical problems. Before the 
turnaround there was also a rush by itsers, as now, to use statistical methods without 
understanding the assumptions and their consequences. I feel it is the u.ser who must 
make the assumptions rather than just the statistician! Arguably, in a quantitative 
area, the user is not well prepared to do that. 

Both those bec:oming .statisticians and the lusers need to realize that there are 
underlying concepts for the field and they imust use an understanding of the concepts 
rather than a catalog of methods. .Inst knowing how to compute docs not help, 
and even being able only to prove lots of theorems would not. W'e CAN teach these 
concepts, and many of them even at fairly low- level courses. The applied statistician 
net*ds to be able in many citses to invent new methods on the spot. There will be 
great opportunities for collaboration between applied .scientists and mathematicians 
in the coming years. I hope neither ignores the other as an ancillary. That will be 
a mistake. 

Thank you for your interesting views on research and teaching and for 
the interesting stories on your life. It was a pleasure. Good luck to you 
and we hope to continue to walk through the door and ask a question 
and get help. You have been a gracious resource to all of us. I wish you 
health and happiness. 

You are very w'elcome. 
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